Researcher profile

Johan Ugander

Johan Ugander contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference

Variance reduction for causal inference in the presence of network interference is often achieved through either outcome modeling, typically analyzed under unit-randomized Bernoulli designs, or clustered experimental designs, typically analyzed without strong parametric assumptions. In this work, we study the intersection of these two approaches and make the following threefold contributions. First, we present an estimator of the total treatment effect (or global average treatment effect) in low-order outcome models when the data are collected under general experimental designs, generalizing previous results for Bernoulli designs. We refer to this estimator as the pseudoinverse estimator and give bounds on its bias and variance in terms of properties of the experimental design. Second, we evaluate these bounds for the case of Bernoulli graph cluster randomized (GCR) designs. Its variance scales like the smaller of the variance obtained by the estimator derived under a low-order assumption, and the variance obtained from cluster randomization, showing that combining these variance reduction strategies is preferable to using either individually. When the order of the potential outcomes model is correctly specified, our estimator is always unbiased, and under a misspecified model, we upper bound the bias by the closeness of the ground truth model to a low-order model. Third, we give empirical evidence that our variance bounds can be used to select a good clustering that minimizes the worst-case variance under a cluster randomized design from a set of candidate clusterings. Across a range of graphs and clustering algorithms, our method consistently selects clusterings that perform well on a range of response models, suggesting the practical use of our bounds.

preprint2020arXiv

Choosing to Grow a Graph: Modeling Network Formation as Discrete Choice

We provide a framework for modeling social network formation through conditional multinomial logit models from discrete choice and random utility theory, in which each new edge is viewed as a "choice" made by a node to connect to another node, based on (generic) features of the other nodes available to make a connection. This perspective on network formation unifies existing models such as preferential attachment, triadic closure, and node fitness, which are all special cases, and thereby provides a flexible means for conceptualizing, estimating, and comparing models. The lens of discrete choice theory also provides several new tools for analyzing social network formation; for example, the significance of node features can be evaluated in a statistically rigorous manner, and mixtures of existing models can be estimated by adapting known expectation-maximization algorithms. We demonstrate the flexibility of our framework through examples that analyze a number of synthetic and real-world datasets. For example, we provide rigorous methods for estimating preferential attachment models and show how to separate the effects of preferential attachment and triadic closure. Non-parametric estimates of the importance of degree show a highly linear trend, and we expose the importance of looking carefully at nodes with degree zero. Examining the formation of a large citation graph, we find evidence for an increased role of degree when accounting for age.

preprint2020arXiv

Discovering Context Effects from Raw Choice Data

Many applications in preference learning assume that decisions come from the maximization of a stable utility function. Yet a large experimental literature shows that individual choices and judgements can be affected by "irrelevant" aspects of the context in which they are made. An important class of such contexts is the composition of the choice set. In this work, our goal is to discover such choice set effects from raw choice data. We introduce an extension of the Multinomial Logit (MNL) model, called the context dependent random utility model (CDM), which allows for a particular class of choice set effects. We show that the CDM can be thought of as a second-order approximation to a general choice system, can be inferred optimally using maximum likelihood and, importantly, is easily interpretable. We apply the CDM to both real and simulated choice data to perform principled exploratory analyses for the presence of choice set effects.

preprint2020arXiv

Evaluating stochastic seeding strategies in networks

When trying to maximize the adoption of a behavior in a population connected by a social network, it is common to strategize about where in the network to seed the behavior, often with an element of randomness. Selecting seeds uniformly at random is a basic but compelling strategy in that it distributes seeds broadly throughout the network. A more sophisticated stochastic strategy, one-hop targeting, is to select random network neighbors of random individuals; this exploits a version of the friendship paradox, whereby the friend of a random individual is expected to have more friends than a random individual, with the hope that seeding a behavior at more connected individuals leads to more adoption. Many seeding strategies have been proposed, but empirical evaluations have demanded large field experiments designed specifically for this purpose and have yielded relatively imprecise comparisons of strategies. Here we show how stochastic seeding strategies can be evaluated more efficiently in such experiments, how they can be evaluated "off-policy" using existing data arising from experiments designed for other purposes, and how to design more efficient experiments. In particular, we consider contrasts between stochastic seeding strategies and analyze nonparametric estimators adapted from policy evaluation and importance sampling. We use simulations on real networks to show that the proposed estimators and designs can increase precision while yielding valid inference. We then apply our proposed estimators to two field experiments, one that assigned households to an intensive marketing intervention and one that assigned students to an anti-bullying intervention.

preprint2020arXiv

Fundamental Limits of Testing the Independence of Irrelevant Alternatives in Discrete Choice

The Multinomial Logit (MNL) model and the axiom it satisfies, the Independence of Irrelevant Alternatives (IIA), are together the most widely used tools of discrete choice. The MNL model serves as the workhorse model for a variety of fields, but is also widely criticized, with a large body of experimental literature claiming to document real-world settings where IIA fails to hold. Statistical tests of IIA as a modelling assumption have been the subject of many practical tests focusing on specific deviations from IIA over the past several decades, but the formal size properties of hypothesis testing IIA are still not well understood. In this work we replace some of the ambiguity in this literature with rigorous pessimism, demonstrating that any general test for IIA with low worst-case error would require a number of samples exponential in the number of alternatives of the choice problem. A major benefit of our analysis over previous work is that it lies entirely in the finite-sample domain, a feature crucial to understanding the behavior of tests in the common data-poor settings of discrete choice. Our lower bounds are structure-dependent, and as a potential cause for optimism, we find that if one restricts the test of IIA to violations that can occur in a specific collection of choice sets (e.g., pairs), one obtains structure-dependent lower bounds that are much less pessimistic. Our analysis of this testing problem is unorthodox in being highly combinatorial, counting Eulerian orientations of cycle decompositions of a particular bipartite graph constructed from a data set of choices. By identifying fundamental relationships between the comparison structure of a given testing problem and its sample efficiency, we hope these relationships will help lay the groundwork for a rigorous rethinking of the IIA testing problem as well as other testing problems in discrete choice.

preprint2020arXiv

Prioritized Restreaming Algorithms for Balanced Graph Partitioning

Balanced graph partitioning is a critical step for many large-scale distributed computations with relational data. As graph datasets have grown in size and density, a range of highly-scalable balanced partitioning algorithms have appeared to meet varied demands across different domains. As the starting point for the present work, we observe that two recently introduced families of iterative partitioners---those based on restreaming and those based on balanced label propagation (including Facebook's Social Hash Partitioner)---can be viewed through a common modular framework of design decisions. With the help of this modular perspective, we find that a key combination of design decisions leads to a novel family of algorithms with notably better empirical performance than any existing highly-scalable algorithm on a broad range of real-world graphs. The resulting prioritized restreaming algorithms employ a constraint management strategy based on multiplicative weights, borrowed from the restreaming literature, while adopting notions of priority from balanced label propagation to optimize the ordering of the streaming process. Our experimental results consider a range of stream orders, where a dynamic ordering based on what we call ambivalence is broadly the most performative in terms of the cut quality of the resulting balanced partitions, with a static ordering based on degree being nearly as good.

preprint2020arXiv

Randomized Graph Cluster Randomization

The global average treatment effect (GATE) is a primary quantity of interest in the study of causal inference under network interference. With a correctly specified exposure model of the interference, the Horvitz-Thompson (HT) and Hájek estimators of the GATE are unbiased and consistent, respectively, yet known to exhibit extreme variance under many designs and in many settings of interest. With a fixed clustering of the interference graph, graph cluster randomization (GCR) designs have been shown to greatly reduce variance compared to node-level random assignment, but even so the variance is still often prohibitively large. In this work we propose a randomized version of the GCR design, descriptively named randomized graph cluster randomization (RGCR), which uses a random clustering rather than a single fixed clustering. By considering an ensemble of many different cluster assignments, this design avoids a key problem with GCR where a given node is sometimes "lucky" or "unlucky" in a given clustering. We propose two randomized graph decomposition algorithms for use with RGCR, randomized 3-net and 1-hop-max, adapted from prior work on multiway graph cut problems. When integrating over their own randomness, these algorithms furnish network exposure probabilities that can be estimated efficiently. We develop upper bounds on the variance of the HT estimator of the GATE under assumptions on the metric structure of the interference graph. Where the best known variance upper bound for the HT estimator under a GCR design is exponential in the parameters of the metric structure, we give a comparable variance upper bound under RGCR that is instead polynomial in the same parameters. We provide extensive simulations comparing RGCR and GCR designs, observing substantial reductions in the mean squared error for both HT and Hájek estimators of the GATE in a variety of settings.

preprint2020arXiv

Scaling Choice Models of Relational Social Data

Many prediction problems on social networks, from recommendations to anomaly detection, can be approached by modeling network data as a sequence of relational events and then leveraging the resulting model for prediction. Conditional logit models of discrete choice are a natural approach to modeling relational events as "choices" in a framework that envelops and extends many long-studied models of network formation. The conditional logit model is simplistic, but it is particularly attractive because it allows for efficient consistent likelihood maximization via negative sampling, something that isn't true for mixed logit and many other richer models. The value of negative sampling is particularly pronounced because choice sets in relational data are often enormous. Given the importance of negative sampling, in this work we introduce a model simplification technique for mixed logit models that we call "de-mixing", whereby standard mixture models of network formation---particularly models that mix local and global link formation---are reformulated to operate their modes over disjoint choice sets. This reformulation reduces mixed logit models to conditional logit models, opening the door to negative sampling while also circumventing other standard challenges with maximizing mixture model likelihoods. To further improve scalability, we also study importance sampling for more efficiently selecting negative samples, finding that it can greatly speed up inference in both standard and de-mixed models. Together, these steps make it possible to much more realistically model network formation in very large graphs. We illustrate the relative gains of our improvements on synthetic datasets with known ground truth as well as a large-scale dataset of public transactions on the Venmo platform.