Source author record

C. Seshadhri

C. Seshadhri appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Social and Information Networks Discrete Mathematics physics.soc-ph Artificial Intelligence Computational Complexity Computational Geometry math.CO Computational Engineering, Finance, and Science Distributed, Parallel, and Cluster Computing cond-mat.dis-nn Databases Machine Learning math.AC Mathematical Software nlin.AO nlin.CD nlin.CG physics.data-an

Catalog footprint

What is connected

51works

19topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Counting hypertriangles through hypergraph orientations

Counting the number of small patterns is a central task in network analysis. While this problem is well studied for graphs, many real-world datasets are naturally modeled as hypergraphs, motivating the need for efficient hypergraph motif counting algorithms. In particular, we study the problem of counting hypertriangles - collections of three pairwise-intersecting hyperedges. These hypergraph patterns have a rich structure with multiple distinct intersection patterns unlike graph triangles. Inspired by classical graph algorithms based on orientations and degeneracy, we develop a theoretical framework that generalizes these concepts to hypergraphs and yields provable algorithms for hypertriangle counting. We implement these ideas in DITCH (Degeneracy Inspired Triangle Counter for Hypergraphs) and show experimentally that it is 10-100x faster and more memory efficient than existing state-of-the-art methods.

preprint2022arXiv

Classic Graph Structural Features Outperform Factorization-Based Graph Embedding Methods on Community Labeling

Graph representation learning (also called graph embeddings) is a popular technique for incorporating network structure into machine learning models. Unsupervised graph embedding methods aim to capture graph structure by learning a low-dimensional vector representation (the embedding) for each node. Despite the widespread use of these embeddings for a variety of downstream transductive machine learning tasks, there is little principled analysis of the effectiveness of this approach for common tasks. In this work, we provide an empirical and theoretical analysis for the performance of a class of embeddings on the common task of pairwise community labeling. This is a binary variant of the classic community detection problem, which seeks to build a classifier to determine whether a pair of vertices participate in a community. In line with our goal of foundational understanding, we focus on a popular class of unsupervised embedding techniques that learn low rank factorizations of a vertex proximity matrix (this class includes methods like GraRep, DeepWalk, node2vec, NetMF). We perform detailed empirical analysis for community labeling over a variety of real and synthetic graphs with ground truth. In all cases we studied, the models trained from embedding features perform poorly on community labeling. In constrast, a simple logistic model with classic graph structural features handily outperforms the embedding models. For a more principled understanding, we provide a theoretical analysis for the (in)effectiveness of these embeddings in capturing the community structure. We formally prove that popular low-dimensional factorization methods either cannot produce community structure, or can only produce ``unstable" communities. These communities are inherently unstable under small perturbations.

preprint2022arXiv

Randomized Algorithms for Scientific Computing (RASC)

Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.

preprint2020arXiv

Distribution-Free Models of Social Networks

The structure of large-scale social networks has predominantly been articulated using generative models, a form of average-case analysis. This chapter surveys recent proposals of more robust models of such networks. These models posit deterministic and empirically supported combinatorial structure rather than a specific probability distribution. We discuss the formal definitions of these models and how they relate to empirical observations in social networks, as well as the known structural and algorithmic results for the corresponding graph classes.

preprint2020arXiv

How the Degeneracy Helps for Triangle Counting in Graph Streams

We revisit the well-studied problem of triangle count estimation in graph streams. Given a graph represented as a stream of $m$ edges, our aim is to compute a $(1\pm\varepsilon)$-approximation to the triangle count $T$, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially $Θ(\min(m^{3/2}/T, m/\sqrt{T}))$ (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for \emph{low degeneracy graphs}. The degeneracy, $κ$, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minor-closed families, and preferential attachment graphs). We design a streaming algorithm with space complexity $\widetilde{O}(mκ/T)$. For constant degeneracy graphs, this bound is $\widetilde{O}(m/T)$, which is significantly smaller than both $m^{3/2}/T$ and $m/\sqrt{T}$. We complement our algorithmic result with a nearly matching lower bound of $Ω(mκ/T)$.

preprint2020arXiv

How to Count Triangles, without Seeing the Whole Graph

Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph? We formally study triangle counting in the {\em random walk} access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting real-world graphs. Even sampling a uniform random vertex is a hard task in this model. Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degree-biased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5\% relative error by looking at 3\% of the number of edges.

preprint2020arXiv

Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS

Clique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is itself considered a challenging problem. Counting near-cliques is significantly harder more so since the search space for near-cliques is orders of magnitude larger than that of cliques. We give a formulation of a near-clique as a clique that is missing a constant number of edges. We exploit the fact that a near-clique contains a smaller clique, and use techniques for clique sampling to count near-cliques. This method allows us to count near-cliques with 1 or 2 missing edges, in graphs with tens of millions of edges. To the best of our knowledge, there was no known efficient method for this problem, and we obtain a 10x - 100x speedup over existing algorithms for counting near-cliques. Our main technique is a space-efficient adaptation of the Turán Shadow sampling approach, recently introduced by Jain and Seshadhri (WWW 2017). This approach constructs a large recursion tree (called the Turán Shadow) that represents cliques in a graph. We design a novel algorithm that builds an estimator for near-cliques, using an online, compact construction of the Turán Shadow.

preprint2020arXiv

The impossibility of low rank representations for triangle-rich complex networks

The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems to name a few. A widely used algorithmic technique for modeling such complex networks is to construct a low-dimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to the common view, we argue that such graph embeddings do not}capture salient properties of complex networks. The two properties we focus on are low degree and large clustering coefficients, which have been widely established to be empirically true for real-world networks. We mathematically prove that any embedding (that uses dot products to measure similarity) that can successfully create these two properties must have rank nearly linear in the number of vertices. Among other implications, this establishes that popular embedding techniques such as Singular Value Decomposition and node2vec fail to capture significant structural aspects of real-world complex networks. Furthermore, we empirically study a number of different embedding techniques based on dot product, and show that they all fail to capture the triangle structure.

preprint2020arXiv

The Power of Pivoting for Exact Clique Counting

Clique counting is a fundamental task in network analysis, and even the simplest setting of $3$-cliques (triangles) has been the center of much recent research. Getting the count of $k$-cliques for larger $k$ is algorithmically challenging, due to the exponential blowup in the search space of large cliques. But a number of recent applications (especially for community detection or clustering) use larger clique counts. Moreover, one often desires \textit{local} counts, the number of $k$-cliques per vertex/edge. Our main result is Pivoter, an algorithm that exactly counts the number of $k$-cliques, \textit{for all values of $k$}. It is surprisingly effective in practice, and is able to get clique counts of graphs that were beyond the reach of previous work. For example, Pivoter gets all clique counts in a social network with a 100M edges within two hours on a commodity machine. Previous parallel algorithms do not terminate in days. Pivoter can also feasibly get local per-vertex and per-edge $k$-clique counts (for all $k$) for many public data sets with tens of millions of edges. To the best of our knowledge, this is the first algorithm that achieves such results. The main insight is the construction of a Succinct Clique Tree (SCT) that stores a compressed unique representation of all cliques in an input graph. It is built using a technique called \textit{pivoting}, a classic approach by Bron-Kerbosch to reduce the recursion tree of backtracking algorithms for maximal cliques. Remarkably, the SCT can be built without actually enumerating all cliques, and provides a succinct data structure from which exact clique statistics ($k$-clique counts, local counts) can be read off efficiently.

preprint2016arXiv

A $\widetilde{O}(n)$ Non-Adaptive Tester for Unateness

Khot and Shinkar (RANDOM, 2016) recently describe an adaptive, $O(n \log(n)/\varepsilon)$-query tester for unateness of Boolean functions $f:\{0,1\}^n \to \{0,1\}$. In this note we describe a simple non-adaptive, $O(n \log(n/\varepsilon)/\varepsilon)$ -query tester for unateness for functions over the hypercube with any ordered range.

preprint2016arXiv

ESCAPE: Efficiently Counting All 5-Vertex Subgraphs

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for \emph{all} 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for $5$-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all $4$-vertex patterns. This algorithm is typically ten times faster than the state of the art $4$-vertex counters.

preprint2015arXiv

A simpler sublinear algorithm for approximating the triangle count

A recent result of Eden, Levi, and Ron (ECCC 2015) provides a sublinear time algorithm to estimate the number of triangles in a graph. Given an undirected graph $G$, one can query the degree of a vertex, the existence of an edge between vertices, and the $i$th neighbor of a vertex. Suppose the graph has $n$ vertices, $m$ edges, and $t$ triangles. In this model, Eden et al provided a $O(\poly(\eps^{-1}\log n)(n/t^{1/3} + m^{3/2}/t))$ time algorithm to get a $(1+\eps)$-multiplicative approximation for $t$, the triangle count. This paper provides a simpler algorithm with the same running time (up to differences in the $\poly(\eps^{-1}\log n)$ factor) that has a substantially simpler analysis.

preprint2015arXiv

Approximately Counting Triangles in Sublinear Time

We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a {\em sublinear-time\/} algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertex-pair queries and neighbor queries. We show that for any given approximation parameter $0<ε<1$, the algorithm provides an estimate $\widehat{t}$ such that with high constant probability, $(1-ε)\cdot t< \widehat{t}<(1+ε)\cdot t$, where $t$ is the number of triangles in the graph $G$. The expected query complexity of the algorithm is $\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)\cdot {\rm poly}(\log n, 1/ε)$, where $n$ is the number of vertices in the graph and $m$ is the number of edges, and the expected running time is $\!\left(\frac{n}{t^{1/3}} + \frac{m^{3/2}}{t}\right)\cdot {\rm poly}(\log n, 1/ε)$. We also prove that $Ω\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)$ queries are necessary, thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in $n$ (and the dependence on $1/ε$).

preprint2015arXiv

Avoiding the Global Sort: A Faster Contour Tree Algorithm

We revisit the classical problem of computing the \emph{contour tree} of a scalar field $f:\mathbb{M} \to \mathbb{R}$, where $\mathbb{M}$ is a triangulated simplicial mesh in $\mathbb{R}^d$. The contour tree is a fundamental topological structure that tracks the evolution of level sets of $f$ and has numerous applications in data analysis and visualization. All existing algorithms begin with a global sort of at least all critical values of $f$, which can require (roughly) $Ω(n\log n)$ time. Existing lower bounds show that there are pathological instances where this sort is required. We present the first algorithm whose time complexity depends on the contour tree structure, and avoids the global sort for non-pathological inputs. If $C$ denotes the set of critical points in $\mathbb{M}$, the running time is roughly $O(\sum_{v \in C} \log \ell_v)$, where $\ell_v$ is the depth of $v$ in the contour tree. This matches all existing upper bounds, but is a significant improvement when the contour tree is short and fat. Specifically, our approach ensures that any comparison made is between nodes in the same descending path in the contour tree, allowing us to argue strong optimality properties of our algorithm. Our algorithm requires several novel ideas: partitioning $\mathbb{M}$ in well-behaved portions, a local growing procedure to iteratively build contour trees, and the use of heavy path decompositions for the time complexity analysis.

preprint2015arXiv

Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail.

preprint2015arXiv

Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions

Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.

preprint2015arXiv

Trigger detection for adaptive scientific workflows using percentile sampling

Increasing complexity of scientific simulations and HPC architectures are driving the need for adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for post-analysis, however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase and adapt the workflow accordingly. We present a method for detecting triggers and demonstrate its use in direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noise-tolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, thus is prohibitively expensive. To overcome this bottleneck, we propose a quantile-sampling approach. Our algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, thus our proposed algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition (HCCI) and reactivity controlled compression ignition (RCCI) simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used for dynamic workflow decisions about data storage and mesh resolution in future combustion simulations. Proposed framework is generalizable and we detail how it could be applied to a broad class of scientific simulation workflows.

preprint2014arXiv

A o(n) monotonicity tester for Boolean functions over the hypercube

A Boolean function $f:\{0,1\}^n \mapsto \{0,1\}$ is said to be $\eps$-far from monotone if $f$ needs to be modified in at least $\eps$-fraction of the points to make it monotone. We design a randomized tester that is given oracle access to $f$ and an input parameter $\eps>0$, and has the following guarantee: It outputs {\sf Yes} if the function is monotonically non-decreasing, and outputs {\sf No} with probability $>2/3$, if the function is $\eps$-far from monotone. This non-adaptive, one-sided tester makes $O(n^{7/8}\eps^{-3/2}\ln(1/\eps))$ queries to the oracle.

preprint2014arXiv

Characterizing short-term stability for Boolean networks over any distribution of transfer functions

We present a characterization of short-term stability of random Boolean networks under \emph{arbitrary} distributions of transfer functions. Given any distribution of transfer functions for a random Boolean network, we present a formula that decides whether short-term chaos (damage spreading) will happen. We provide a formal proof for this formula, and empirically show that its predictions are accurate. Previous work only works for special cases of balanced families. It has been observed that these characterizations fail for unbalanced families, yet such families are widespread in real biological networks.

preprint2014arXiv

Counting Triangles in Real-World Graph Streams: Dealing with Repeated Edges and Time Windows

Real-world graphs often manifest as a massive temporal stream of edges. The need for real-time analysis of such large graph streams has led to progress on low memory, one-pass streaming graph algorithms. These algorithms were designed for simple graphs, assuming an edge is not repeated in the stream. Real graph streams however, are almost always multigraphs i.e., they contain many duplicate edges. The assumption of no repeated edges requires an extra pass *storing all the edges* just for deduplication, which defeats the purpose of small memory algorithms. We describe an algorithm for estimating the triangle count of a multigraph stream of edges. We show that all previous streaming algorithms for triangle counting fail for multigraph streams, despite their impressive accuracies for simple graphs. The bias created by duplicate edges is a major problem, and leads these algorithms astray. Our algorithm avoids these biases through careful debiasing strategies and has provable theoretical guarantees and excellent empirical performance. Our algorithm builds on the previously introduced wedge sampling methodology. Another challenge in analyzing temporal graphs is finding the right temporal window size. Our algorithm seamlessly handles multiple time windows, and does not require committing to any window size(s) a priori. We apply our algorithm to discover fascinating transitivity and triangle trends in real-world graph streams.

preprint2014arXiv

Decompositions of Triangle-Dense Graphs

High triangle density -- the graph property stating that a constant fraction of two-hop paths belong to a triangle -- is a common signature of social networks. This paper studies triangle-dense graphs from a structural perspective. We prove constructively that significant portions of a triangle-dense graph are contained in a disjoint union of dense, radius 2 subgraphs. This result quantifies the extent to which triangle-dense graphs resemble unions of cliques. We also show that our algorithm recovers planted clusterings in approximation-stable k-median instances.

preprint2014arXiv

Directed closure measures for networks with reciprocity

The study of triangles in graphs is a standard tool in network analysis, leading to measures such as the \emph{transitivity}, i.e., the fraction of paths of length $2$ that participate in triangles. Real-world networks are often directed, and it can be difficult to "measure" this network structure meaningfully. We propose a collection of \emph{directed closure values} for measuring triangles in directed graphs in a way that is analogous to transitivity in an undirected graph. Our study of these values reveals much information about directed triadic closure. For instance, we immediately see that reciprocal edges have a high propensity to participate in triangles. We also observe striking similarities between the triadic closure patterns of different web and social networks. We perform mathematical and empirical analysis showing that directed configuration models that preserve reciprocity cannot capture the triadic closure patterns of real networks.

preprint2014arXiv

FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs

We propose a new algorithm, FAST-PPR, for estimating personalized PageRank: given start node $s$ and target node $t$ in a directed graph, and given a threshold $δ$, FAST-PPR estimates the Personalized PageRank $π_s(t)$ from $s$ to $t$, guaranteeing a small relative error as long $π_s(t)>δ$. Existing algorithms for this problem have a running-time of $Ω(1/δ)$; in comparison, FAST-PPR has a provable average running-time guarantee of ${O}(\sqrt{d/δ})$ (where $d$ is the average in-degree of the graph). This is a significant improvement, since $δ$ is often $O(1/n)$ (where $n$ is the number of nodes) for applications. We also complement the algorithm with an $Ω(1/\sqrtδ)$ lower bound for PageRank estimation, showing that the dependence on $δ$ cannot be improved. We perform a detailed empirical study on numerous massive graphs, showing that FAST-PPR dramatically outperforms existing algorithms. For example, on the 2010 Twitter graph with 1.5 billion edges, for target nodes sampled by popularity, FAST-PPR has a $20$ factor speedup over the state of the art. Furthermore, an enhanced version of FAST-PPR has a $160$ factor speedup on the Twitter graph, and is at least $20$ times faster on all our candidate graphs.

preprint2014arXiv

Optimal bounds for monotonicity and Lipschitz testing over hypercubes and hypergrids

The problem of monotonicity testing over the hypergrid and its special case, the hypercube, is a classic, well-studied, yet unsolved question in property testing. We are given query access to $f:[k]^n \mapsto \R$ (for some ordered range $\R$). The hypergrid/cube has a natural partial order given by coordinate-wise ordering, denoted by $\prec$. A function is \emph{monotone} if for all pairs $x \prec y$, $f(x) \leq f(y)$. The distance to monotonicity, $\eps_f$, is the minimum fraction of values of $f$ that need to be changed to make $f$ monotone. For $k=2$ (the boolean hypercube), the usual tester is the \emph{edge tester}, which checks monotonicity on adjacent pairs of domain points. It is known that the edge tester using $O(\eps^{-1}n\log|\R|)$ samples can distinguish a monotone function from one where $\eps_f > \eps$. On the other hand, the best lower bound for monotonicity testing over the hypercube is $\min(|\R|^2,n)$. This leaves a quadratic gap in our knowledge, since $|\R|$ can be $2^n$. We resolve this long standing open problem and prove that $O(n/\eps)$ samples suffice for the edge tester. For hypergrids, known testers require $O(\eps^{-1}n\log k\log |\R|)$ samples, while the best known (non-adaptive) lower bound is $Ω(\eps^{-1} n\log k)$. We give a (non-adaptive) monotonicity tester for hypergrids running in $O(\eps^{-1} n\log k)$ time. Our techniques lead to optimal property testers (with the same running time) for the natural \emph{Lipschitz property} on hypercubes and hypergrids. (A $c$-Lipschitz function is one where $|f(x) - f(y)| \leq c\|x-y\|_1$.) In fact, we give a general unified proof for $O(\eps^{-1}n\log k)$-query testers for a class of "bounded-derivative" properties, a class containing both monotonicity and Lipschitz.

preprint2014arXiv

Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization. We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.

preprint2014arXiv

Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties

The primary problem in property testing is to decide whether a given function satisfies a certain property, or is far from any function satisfying it. This crucially requires a notion of distance between functions. The most prevalent notion is the Hamming distance over the {\em uniform} distribution on the domain. This restriction to uniformity is more a matter of convenience than of necessity, and it is important to investigate distances induced by more general distributions. In this paper, we make significant strides in this direction. We give simple and optimal testers for {\em bounded derivative properties} over {\em arbitrary product distributions}. Bounded derivative properties include fundamental properties such as monotonicity and Lipschitz continuity. Our results subsume almost all known results (upper and lower bounds) on monotonicity and Lipschitz testing. We prove an intimate connection between bounded derivative property testing and binary search trees (BSTs). We exhibit a tester whose query complexity is the sum of expected depths of optimal BSTs for each marginal. Furthermore, we show this sum-of-depths is also a lower bound. A fundamental technical contribution of this work is an {\em optimal dimension reduction theorem} for all bounded derivative properties, which relates the distance of a function from the property to the distance of restrictions of the function to random lines. Such a theorem has been elusive even for monotonicity for the past 15 years, and our theorem is an exponential improvement to the previous best known result.

preprint2014arXiv

Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs

Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately-sized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degree-wise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than state-of-the-art alternatives.

preprint2014arXiv

Why do simple algorithms for triangle enumeration work in the real world?

Listing all triangles is a fundamental graph operation. Triangles can have important interpretations in real-world graphs, especially social and other interaction networks. Despite the lack of provably efficient (linear, or slightly super-linear) worst-case algorithms for this problem, practitioners run simple, efficient heuristics to find all triangles in graphs with millions of vertices. How are these heuristics exploiting the structure of these special graphs to provide major speedups in running time? We study one of the most prevalent algorithms used by practitioners. A trivial algorithm enumerates all paths of length $2$, and checks if each such path is incident to a triangle. A good heuristic is to enumerate only those paths of length $2$ where the middle vertex has the lowest degree. It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm. We study the behavior of this algorithm over graphs with heavy-tailed degree distributions, a defining feature of real-world graphs. The erased configuration model (ECM) efficiently generates a graph with asymptotically (almost) any desired degree sequence. We show that the expected running time of this algorithm over the distribution of graphs created by the ECM is controlled by the $\ell_{4/3}$-norm of the degree sequence. Norms of the degree sequence are a measure of the heaviness of the tail, and it is precisely this feature that allows non-trivial speedups of simple triangle enumeration algorithms. As a corollary of our main theorem, we prove expected linear-time performance for degree sequences following a power law with exponent $α\geq 7/3$, and non-trivial speedup whenever $α\in (2,3)$.

preprint2013arXiv

A Scalable Generative Graph Model with Community Structure

Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdss-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementation for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.

preprint2013arXiv

A Scalable Null Model for Directed Graphs Matching All Degree Distributions: In, Out, and Reciprocal

Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or Chung-Lu model can generate an undirected graph with any desired degree distribution. This serves as a good null model to compare algorithms or perform experimental studies. Furthermore, there are scalable algorithms that implement these models and they are invaluable in the study of graphs. However, networks in the real-world are often directed, and have a significant proportion of reciprocal edges. A stronger relation exists between two nodes when they each point to one another (reciprocal edge) as compared to when only one points to the other (one-way edge). Despite their importance, reciprocal edges have been disregarded by most directed graph models. We propose a null model for directed graphs inspired by the Chung-Lu model that matches the in-, out-, and reciprocal-degree distributions of the real graphs. Our algorithm is scalable and requires $O(m)$ random numbers to generate a graph with $m$ edges. We perform a series of experiments on real datasets and compare with existing graph models.

preprint2013arXiv

A space efficient streaming algorithm for triangle counting using the birthday paradox

We design a space efficient algorithm that approximates the transitivity (global clustering coefficient) and total triangle count with only a single pass through a graph given as a stream of edges. Our procedure is based on the classic probabilistic result, the birthday paradox. When the transitivity is constant and there are more edges than wedges (common properties for social networks), we can prove that our algorithm requires $O(\sqrt{n})$ space ($n$ is the number of vertices) to provide accurate estimates. We run a detailed set of experiments on a variety of real graphs and demonstrate that the memory requirement of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 60,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains a real-time estimate of the transitivity/number of triangles of a graph, by storing a minuscule fraction of edges.

preprint2013arXiv

An In-Depth Analysis of Stochastic Kronecker Graphs

Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing real-world graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a power-law distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that the graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices.

preprint2013arXiv

An optimal lower bound for monotonicity testing over hypergrids

For positive integers $n, d$, consider the hypergrid $[n]^d$ with the coordinate-wise product partial ordering denoted by $\prec$. A function $f: [n]^d \mapsto \mathbb{N}$ is monotone if $\forall x \prec y$, $f(x) \leq f(y)$. A function $f$ is $\eps$-far from monotone if at least an $\eps$-fraction of values must be changed to make $f$ monotone. Given a parameter $\eps$, a \emph{monotonicity tester} must distinguish with high probability a monotone function from one that is $\eps$-far. We prove that any (adaptive, two-sided) monotonicity tester for functions $f:[n]^d \mapsto \mathbb{N}$ must make $Ω(\eps^{-1}d\log n - \eps^{-1}\log \eps^{-1})$ queries. Recent upper bounds show the existence of $O(\eps^{-1}d \log n)$ query monotonicity testers for hypergrids. This closes the question of monotonicity testing for hypergrids over arbitrary ranges. The previous best lower bound for general hypergrids was a non-adaptive bound of $Ω(d \log n)$.

preprint2013arXiv

Counting Triangles in Massive Graphs with MapReduce

Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 seconds per million edges plus overhead (approximately 225 seconds total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date.

preprint2013arXiv

Estimating the longest increasing sequence in polylogarithmic time

Finding the length of the longest increasing subsequence (LIS) is a classic algorithmic problem. Let $n$ denote the size of the array. Simple $O(n\log n)$ algorithms are known for this problem. We develop a polylogarithmic time randomized algorithm that for any constant $δ> 0$, estimates the length of the LIS of an array to within an additive error of $δn$. More precisely, the running time of the algorithm is $(\log n)^c (1/δ)^{O(1/δ)}$ where the exponent $c$ is independent of $δ$. Previously, the best known polylogarithmic time algorithms could only achieve an additive $n/2$ approximation. With a suitable choice of parameters, our algorithm also gives, for any fixed $τ>0$, a multiplicative $(1+τ)$-approximation to the distance to monotonicity $\varepsilon_f$ (the fraction of entries not in the LIS), whose running time is polynomial in $\log(n)$ and $1/varepsilon_f$. The best previously known algorithm could only guarantee an approximation within a factor (arbitrarily close to) 2.

preprint2013arXiv

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any $δ> 0$, given streaming access to an array of length $n$ provides a $(1+δ)$-multiplicative approximation to the \emph{distance to monotonicity} ($n$ minus the length of the LIS), and uses only $O((\log^2 n)/δ)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. Our algorithm can be used to estimate the length of the LIS to within an additive $δn$ for any $δ>0$ while previous algorithms could only achieve additive error $n(1/2-o(1))$. Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS. We also give a streaming algorithm for approximating $LCS(x,y)$, the length of the longest common subsequence between strings $x$ and $y$, each of length $n$. Our algorithm works in the asymmetric setting (inspired by \cite{AKO10}), in which we have random access to $y$ and streaming access to $x$, and runs in small space provided that no single symbol appears very often in $y$. More precisely, it gives an additive-$δn$ approximation to $LCS(x,y)$ (and hence also to $E(x,y) = n-LCS(x,y)$, the edit distance between $x$ and $y$ when insertions and deletions, but not substitutions, are allowed), with space complexity $O(k(\log^2 n)/δ)$, where $k$ is the maximum number of times any one symbol appears in $y$.

preprint2012arXiv

A stopping criterion for Markov chains when generating independent random graphs

Markov chains are convenient means of generating realizations of networks with a given (joint or otherwise) degree distribution, since they simply require a procedure for rewiring edges. The major challenge is to find the right number of steps to run such a chain, so that we generate truly independent samples. Theoretical bounds for mixing times of these Markov chains are too large to be practically useful. Practitioners have no useful guide for choosing the length, and tend to pick numbers fairly arbitrarily. We give a principled mathematical argument showing that it suffices for the length to be proportional to the number of desired number of edges. We also prescribe a method for choosing this proportionality constant. We run a series of experiments showing that the distributions of common graph properties converge in this time, providing empirical evidence for our claims.

preprint2012arXiv

Are we there yet? When to stop a Markov chain while generating random graphs

Markov chains are a convenient means of generating realizations of networks, since they require little more than a procedure for rewiring edges. If a rewiring procedure exists for generating new graphs with specified statistical properties, then a Markov chain sampler can generate an ensemble of graphs with prescribed characteristics. However, successive graphs in a Markov chain cannot be used when one desires independent draws from the distribution of graphs; the realizations are correlated. Consequently, one runs a Markov chain for N iterations before accepting the realization as an independent sample. In this work, we devise two methods for calculating N. They are both based on the binary "time-series" denoting the occurrence/non-occurrence of edge (u, v) between vertices u and v in the Markov chain of graphs generated by the sampler. They differ in their underlying assumptions. We test them on the generation of graphs with a prescribed joint degree distribution. We find the N proportional |E|, where |E| is the number of edges in the graph. The two methods are compared by sampling on real, sparse graphs with 10^3 - 10^4 vertices.

preprint2012arXiv

Degree Relations of Triangles in Real-world Networks and Models

Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of degree-labeled triangles -- specifically, degree homogeneity versus heterogeneity in triangles. This yields new insight into the structure of real-world graphs. We observe that networks coming from social and collaborative situations are dominated by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other. On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles, i.e., the degrees in triangles are quite disparate. Surprisingly, nodes within the top 1% of degrees participate in the vast majority of triangles in heterogeneous graphs. We also ask the question of whether or not current graph models reproduce the types of triangles that are observed in real data and showed that most models fail to accurately capture these salient features.

preprint2012arXiv

Finding Cycles and Trees in Sublinear Time

We present sublinear-time (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and tree-minors in bounded-degree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$-minor-free (resp., free from having the corresponding tree-minor). In particular, if the graph is far (i.e., $Ω(1)$-far) {from} being cycle-free, i.e. if one has to delete a constant fraction of edges to make it cycle-free, then the algorithm finds a cycle of polylogarithmic length in time $\tildeO(\sqrt{N})$, where $N$ denotes the number of vertices. This time complexity is optimal up to polylogarithmic factors. The foregoing results are the outcome of our study of the complexity of {\em one-sided error} property testing algorithms in the bounded-degree graphs model. For example, we show that cycle-freeness of $N$-vertex graphs can be tested with one-sided error within time complexity $\tildeO(\poly(1/\e)\cdot\sqrt{N})$. This matches the known $Ω(\sqrt{N})$ query lower bound, and contrasts with the fact that any minor-free property admits a {\em two-sided error} tester of query complexity that only depends on the proximity parameter $\e$. For any constant $k\geq3$, we extend this result to testing whether the input graph has a simple cycle of length at least $k$. On the other hand, for any fixed tree $T$, we show that $T$-minor-freeness has a one-sided error tester of query complexity that only depends on the proximity parameter $\e$. Our algorithm for finding cycles in bounded-degree graphs extends to general graphs, where distances are measured with respect to the actual number of edges. Such an extension is not possible with respect to finding tree-minors in $o(\sqrt{N})$ complexity.

preprint2012arXiv

Self-improving Algorithms for Coordinate-wise Maxima

Computing the coordinate-wise maxima of a planar point set is a classic and well-studied problem in computational geometry. We give an algorithm for this problem in the \emph{self-improving setting}. We have $n$ (unknown) independent distributions $\cD_1, \cD_2, ..., \cD_n$ of planar points. An input pointset $(p_1, p_2, ..., p_n)$ is generated by taking an independent sample $p_i$ from each $\cD_i$, so the input distribution $\cD$ is the product $\prod_i \cD_i$. A self-improving algorithm repeatedly gets input sets from the distribution $\cD$ (which is \emph{a priori} unknown) and tries to optimize its running time for $\cD$. Our algorithm uses the first few inputs to learn salient features of the distribution, and then becomes an optimal algorithm for distribution $\cD$. Let $\OPT_\cD$ denote the expected depth of an \emph{optimal} linear comparison tree computing the maxima for distribution $\cD$. Our algorithm eventually has an expected running time of $O(\text{OPT}_\cD + n)$, even though it did not know $\cD$ to begin with. Our result requires new tools to understand linear comparison trees for computing maxima. We show how to convert general linear comparison trees to very restricted versions, which can then be related to the running time of our algorithm. An interesting feature of our algorithm is an interleaved search, where the algorithm tries to determine the likeliest point to be maximal with minimal computation. This allows the running time to be truly optimal for the distribution $\cD$.

preprint2012arXiv

Triadic Measures on Graphs: The Power of Wedge Sampling

Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We propose a new method based on wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration. Our results will enable more wide-scale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several real-world examples.

preprint2011arXiv

Community structure and scale-free collections of Erdös-Rényi graphs

Community structure plays a significant role in the analysis of social networks and similar graphs, yet this structure is little understood and not well captured by most models. We formally define a community to be a subgraph that is internally highly connected and has no deeper substructure. We use tools of combinatorics to show that any such community must contain a dense Erdös-Rényi (ER) subgraph. Based on mathematical arguments, we hypothesize that any graph with a heavy-tailed degree distribution and community structure must contain a scale free collection of dense ER subgraphs. These theoretical observations corroborate well with empirical evidence. From this, we propose the Block Two-Level Erdös-Rényi (BTER) model, and demonstrate that it accurately captures the observable properties of many real-world social networks.

preprint2011arXiv

Influence and Dynamic Behavior in Random Boolean Networks

We present a rigorous mathematical framework for analyzing dynamics of a broad class of Boolean network models. We use this framework to provide the first formal proof of many of the standard critical transition results in Boolean network analysis, and offer analogous characterizations for novel classes of random Boolean networks. We precisely connect the short-run dynamic behavior of a Boolean network to the average influence of the transfer functions. We show that some of the assumptions traditionally made in the more common mean-field analysis of Boolean networks do not hold in general. For example, we offer some evidence that imbalance, or expected internal inhomogeneity, of transfer functions is a crucial feature that tends to drive quiescent behavior far more strongly than previously observed.

preprint2011arXiv

Neighborhoods are good communities

The communities of a social network are sets of vertices with more connections inside the set than outside. We theoretically demonstrate that two commonly observed properties of social networks, heavy-tailed degree distributions and large clustering coefficients, imply the existence of vertex neighborhoods (also known as egonets) that are themselves good communities. We evaluate these neighborhood communities on a range of graphs. What we find is that the neighborhood communities often exhibit conductance scores that are as good as the Fiedler cut. Also, the conductance of neighborhood communities shows similar behavior as the network community profile computed with a personalized PageRank community detection method. The latter requires sweeping over a great many starting vertices, which can be expensive. By using a small and easy-to-compute set of neighborhood communities as seeds for these PageRank communities, however, we find communities that precisely capture the behavior of the network community profile when seeded everywhere in the graph, and at a significant reduction in total work.

preprint2011arXiv

The Similarity between Stochastic Kronecker and Chung-Lu Graph Models

The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little understanding of the properties and behavior of this model. We show that the parallel variant of the edge-configuration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against.

preprint2010arXiv

Blackbox identity testing for bounded top fanin depth-3 circuits: the field doesn't matter

Let C be a depth-3 circuit with n variables, degree d and top fanin k (called sps(k,d,n) circuits) over base field F. It is a major open problem to design a deterministic polynomial time blackbox algorithm that tests if C is identically zero. Klivans & Spielman (STOC 2001) observed that the problem is open even when k is a constant. This case has been subjected to a serious study over the past few years, starting from the work of Dvir & Shpilka (STOC 2005). We give the first polynomial time blackbox algorithm for this problem. Our algorithm runs in time poly(nd^k), regardless of the base field. The only field for which polynomial time algorithms were previously known is F=Q (Kayal & Saraf, FOCS 2009, and Saxena & Seshadhri, FOCS 2010). This is the first blackbox algorithm for depth-3 circuits that does not use the rank based approaches of Karnin & Shpilka (CCC 2008). We prove an important tool for the study of depth-3 identities. We design a blackbox polynomial time transformation that reduces the number of variables in a sps(k,d,n) circuit to k variables, but preserves the identity structure.

preprint2010arXiv

Combinatorial Approximation Algorithms for MaxCut using Random Walks

We give the first combinatorial approximation algorithm for Maxcut that beats the trivial 0.5 factor by a constant. The main partitioning procedure is very intuitive, natural, and easily described. It essentially performs a number of random walks and aggregates the information to provide the partition. We can control the running time to get an approximation factor-running time tradeoff. We show that for any constant b > 1.5, there is an O(n^{b}) algorithm that outputs a (0.5+delta)-approximation for Maxcut, where delta = delta(b) is some positive constant. One of the components of our algorithm is a weak local graph partitioning procedure that may be of independent interest. Given a starting vertex $i$ and a conductance parameter phi, unless a random walk of length ell = O(log n) starting from i mixes rapidly (in terms of phi and ell), we can find a cut of conductance at most phi close to the vertex. The work done per vertex found in the cut is sublinear in n.

preprint2010arXiv

From Sylvester-Gallai Configurations to Rank Bounds: Improved Black-box Identity Test for Depth-3 Circuits

We study the problem of identity testing for depth-3 circuits of top fanin k and degree d. We give a new structure theorem for such identities. A direct application of our theorem improves the known deterministic d^{k^k}-time black-box identity test over rationals (Kayal-Saraf, FOCS 2009) to one that takes d^{k^2}-time. Our structure theorem essentially says that the number of independent variables in a real depth-3 identity is very small. This theorem settles affirmatively the stronger rank conjectures posed by Dvir-Shpilka (STOC 2005) and Kayal-Saraf (FOCS 2009). Our techniques provide a unified framework that actually beats all known rank bounds and hence gives the best running time (for every field) for black-box identity tests. Our main theorem (almost optimally) pins down the relation between higher dimensional Sylvester-Gallai theorems and the rank of depth-3 identities in a very transparent manner. The existence of this was hinted at by Dvir-Shpilka (STOC 2005), but first proven, for reals, by Kayal-Saraf (FOCS 2009). We introduce the concept of Sylvester-Gallai rank bounds for any field, and show the intimate connection between this and depth-3 identity rank bounds. We also prove the first ever theorem about high dimensional Sylvester-Gallai configurations over any field. Our proofs and techniques are very different from previous results and devise a very interesting ensemble of combinatorics and algebra. The latter concepts are ideal theoretic and involve a new Chinese remainder theorem. Our proof methods explain the structure of any depth-3 identity C: there is a nucleus of C that forms a low rank identity, while the remainder is a high dimensional Sylvester-Gallai configuration.

preprint2010arXiv

Is submodularity testable?

We initiate the study of property testing of submodularity on the boolean hypercube. Submodular functions come up in a variety of applications in combinatorial optimization. For a vast range of algorithms, the existence of an oracle to a submodular function is assumed. But how does one check if this oracle indeed represents a submodular function? Consider a function f:{0,1}^n \rightarrow R. The distance to submodularity is the minimum fraction of values of $f$ that need to be modified to make f submodular. If this distance is more than epsilon > 0, then we say that f is epsilon-far from being submodular. The aim is to have an efficient procedure that, given input f that is epsilon-far from being submodular, certifies that f is not submodular. We analyze a very natural tester for this problem, and prove that it runs in subexponential time. This gives the first non-trivial tester for submodularity. On the other hand, we prove an interesting lower bound (that is, unfortunately, quite far from the upper bound) suggesting that this tester cannot be very efficient in terms of epsilon. This involves non-trivial examples of functions which are far from submodular and yet do not exhibit too many local violations. We also provide some constructions indicating the difficulty in designing a tester for submodularity. We construct a partial function defined on exponentially many points that cannot be extended to a submodular function, but any strict subset of these values can be extended to a submodular function.

preprint2010arXiv

Self-Improving Algorithms

We investigate ways in which an algorithm can improve its expected performance by fine-tuning itself automatically with respect to an unknown input distribution D. We assume here that D is of product type. More precisely, suppose that we need to process a sequence I_1, I_2, ... of inputs I = (x_1, x_2, ..., x_n) of some fixed length n, where each x_i is drawn independently from some arbitrary, unknown distribution D_i. The goal is to design an algorithm for these inputs so that eventually the expected running time will be optimal for the input distribution D = D_1 * D_2 * ... * D_n. We give such self-improving algorithms for two problems: (i) sorting a sequence of numbers and (ii) computing the Delaunay triangulation of a planar point set. Both algorithms achieve optimal expected limiting complexity. The algorithms begin with a training phase during which they collect information about the input distribution, followed by a stationary regime in which the algorithms settle to their optimized incarnations.

C. Seshadhri

What is connected

Connect this record

See the researcher in context

Building this map preview

51 published item(s)

Counting hypertriangles through hypergraph orientations

Classic Graph Structural Features Outperform Factorization-Based Graph Embedding Methods on Community Labeling

Randomized Algorithms for Scientific Computing (RASC)

Distribution-Free Models of Social Networks

How the Degeneracy Helps for Triangle Counting in Graph Streams

How to Count Triangles, without Seeing the Whole Graph

Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS

The impossibility of low rank representations for triangle-rich complex networks

The Power of Pivoting for Exact Clique Counting

A $\widetilde{O}(n)$ Non-Adaptive Tester for Unateness

ESCAPE: Efficiently Counting All 5-Vertex Subgraphs

A simpler sublinear algorithm for approximating the triangle count

Approximately Counting Triangles in Sublinear Time

Avoiding the Global Sort: A Faster Contour Tree Algorithm

Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions

Trigger detection for adaptive scientific workflows using percentile sampling

A o(n) monotonicity tester for Boolean functions over the hypercube

Characterizing short-term stability for Boolean networks over any distribution of transfer functions

Counting Triangles in Real-World Graph Streams: Dealing with Repeated Edges and Time Windows

Decompositions of Triangle-Dense Graphs

Directed closure measures for networks with reciprocity

FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs

Optimal bounds for monotonicity and Lipschitz testing over hypercubes and hypergrids

Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts

Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties

Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs

Why do simple algorithms for triangle enumeration work in the real world?

A Scalable Generative Graph Model with Community Structure

A Scalable Null Model for Directed Graphs Matching All Degree Distributions: In, Out, and Reciprocal

A space efficient streaming algorithm for triangle counting using the birthday paradox

An In-Depth Analysis of Stochastic Kronecker Graphs

An optimal lower bound for monotonicity testing over hypergrids

Counting Triangles in Massive Graphs with MapReduce

Estimating the longest increasing sequence in polylogarithmic time

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

A stopping criterion for Markov chains when generating independent random graphs

Are we there yet? When to stop a Markov chain while generating random graphs

Degree Relations of Triangles in Real-world Networks and Models

Finding Cycles and Trees in Sublinear Time

Self-improving Algorithms for Coordinate-wise Maxima

Triadic Measures on Graphs: The Power of Wedge Sampling

Community structure and scale-free collections of Erdös-Rényi graphs

Influence and Dynamic Behavior in Random Boolean Networks

Neighborhoods are good communities

The Similarity between Stochastic Kronecker and Chung-Lu Graph Models

Blackbox identity testing for bounded top fanin depth-3 circuits: the field doesn't matter

Combinatorial Approximation Algorithms for MaxCut using Random Walks

From Sylvester-Gallai Configurations to Rank Bounds: Improved Black-box Identity Test for Depth-3 Circuits

Is submodularity testable?

Self-Improving Algorithms