Source author record

Peter J. Bickel

Peter J. Bickel appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Methodology Machine Learning Applications Social and Information Networks math.FA physics.soc-ph q-fin.ST

Catalog footprint

What is connected

19works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Measures of independence and functional dependence

We follow up on Shi et al's (2020) and Cao's and my (2020) work on the local power of a new test for independence, Chatterjee (2019), and its relation to the local power properties of classical tests. We show quite generally that for testing independence with local alternatives either Chatterjee's rank test has no power, or it may be misleading: The Blum, Kiefer, Rosenblatt, and other omnibus classical rank tests do have some local power in any direction other than those where significant results may be misleading. We also suggest methods of selective inference in independence testing. Chatterjee's statistics like Renyi's (1959) also identified functional dependence. We exhibit statistics which have better power properties than Chatterjee's but also identify functional dependence.

preprint2020arXiv

An Assumption-Free Exact Test For Fixed-Design Linear Models With Exchangeable Errors

We propose the Cyclic Permutation Test (CPT) to test general linear hypotheses for linear models. This test is non-randomized and valid in finite samples with exact Type I error $α$ for an arbitrary fixed design matrix and arbitrary exchangeable errors, whenever $1 / α$ is an integer and $n / p \ge 1 / α- 1$. The test involves applying the marginal rank test to $1 / α$ linear statistics of the outcome vector, where the coefficient vectors are determined by solving a linear system such that the joint distribution of the linear statistics is invariant with respect to a non-standard cyclic permutation group under the null hypothesis.The power can be further enhanced by solving a secondary non-linear travelling salesman problem, for which the genetic algorithm can find a reasonably good solution. Extensive simulation studies show that the CPT has comparable power to existing tests. When testing for a single contrast of coefficients, an exact confidence interval can be obtained by inverting the test. Furthermore, we provide a selective yet extensive literature review of the century-long efforts on this problem, highlighting the novelty of our test.

preprint2020arXiv

Generalized Pearson correlation squares for capturing mixtures of bivariate linear dependences

Motivated by the pressing needs for capturing complex but interpretable variable relationships in scientific research, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued random variables, with or without an index variable that specifies the line memberships. We construct generalized Pearson correlation squares by focusing on three aspects: the exchangeability of the two variables, the independence of parametric model assumptions, and the availability of population-level parameters. For the computation of the generalized Pearson correlation square from a sample without line-membership specification, we develop a K-lines clustering algorithm, where K, the number of lines, can be chosen in a data-adaptive way. With our defined population-level generalized Pearson correlation squares, we derive the asymptotic distributions of the sample-level statistics to enable efficient statistical inference. Simulation studies verify the theoretical results and compare the generalized Pearson correlation squares with other widely-used association measures in terms of power. Gene expression data analysis demonstrates the effectiveness of the generalized Pearson correlation squares in capturing interpretable gene-gene relationships missed by other measures. We implement the estimation and inference procedures in an R package gR2.

preprint2020arXiv

Hierarchical community detection by recursive partitioning

The problem of community detection in networks is usually formulated as finding a single partition of the network into some "correct" number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms K-way spectral clustering, and propose a natural framework for analyzing the algorithm's theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We apply the algorithm to a gene network based on gene co-occurrence in 1580 research papers on anemia, and identify six clusters of genes in a meaningful hierarchy. We also illustrate the algorithm on a dataset of statistics papers.

preprint2016arXiv

Asymptotics For High Dimensional Regression M-Estimates: Fixed Design Results

We investigate the asymptotic distributions of coordinates of regression M-estimates in the moderate $p/n$ regime, where the number of covariates $p$ grows proportionally with the sample size $n$. Under appropriate regularity conditions, we establish the coordinate-wise asymptotic normality of regression M-estimates assuming a fixed-design matrix. Our proof is based on the second-order Poincaré inequality (Chatterjee, 2009) and leave-one-out analysis (El Karoui et al., 2011). Some relevant examples are indicated to show that our regularity conditions are satisfied by a broad class of design matrices. We also show a counterexample, namely the ANOVA-type design, to emphasize that the technical assumptions are not just artifacts of the proof. Finally, the numerical experiments confirm and complement our theoretical results.

preprint2016arXiv

Likelihood-based model selection for stochastic block models

The stochastic block model (SBM) provides a popular framework for modeling community structures in networks. However, more attention has been devoted to problems concerning estimating the latent node labels and the model parameters than the issue of choosing the number of blocks. We consider an approach based on the log likelihood ratio statistic and analyze its asymptotic properties under model misspecification. We show the limiting distribution of the statistic in the case of underfitting is normal and obtain its convergence rate in the case of overfitting. These conclusions remain valid when the average degree grows at a polylog rate. The results enable us to derive the correct order of the penalty term for model complexity and arrive at a likelihood-based model selection criterion that is asymptotically consistent. Our analysis can also be extended to a degree-corrected block model (DCSBM). In practice, the likelihood function can be estimated using more computationally efficient variational methods or consistent label estimation algorithms, allowing the criterion to be applied to large networks.

preprint2015arXiv

Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis

Networks pervade many disciplines of science for analyzing complex systems with interacting components. In particular, this concept is commonly used to model interactions between genes and identify closely associated genes forming functional modules. In this paper, we focus on gene group interactions and infer these interactions using appropriate partial correlations between genes, that is, the conditional dependencies between genes after removing the influences of a set of other functionally related genes. We introduce a new method for estimating group interactions using sparse canonical correlation analysis (SCCA) coupled with repeated random partition and subsampling of the gene expression data set. By considering different subsets of genes and ways of grouping them, our interaction measure can be viewed as an aggregated estimate of partial correlations of different orders. Our approach is unique in evaluating conditional dependencies when the correct dependent sets are unknown or only partially known. As a result, a gene network can be constructed using the interaction measures as edge weights and gene functional groups can be inferred as tightly connected communities from the network. Comparisons with several popular approaches using simulated and real data show our procedure improves both the statistical significance and biological interpretability of the results. In addition to achieving considerably lower false positive rates, our procedure shows better performance in detecting important biological pathways.

preprint2015arXiv

Spectral Clustering and Block Models: A Review And A New Algorithm

We focus on spectral clustering of unlabeled graphs and review some results on clustering methods which achieve weak or strong consistent identification in data generated by such models. We also present a new algorithm which appears to perform optimally both theoretically using asymptotic theory and empirically.

preprint2015arXiv

Subsampling bootstrap of count features of networks

Analysis of stochastic models of networks is quite important in light of the huge influx of network data in social, information and bio sciences, but a proper statistical analysis of features of different stochastic models of networks is still underway. We propose bootstrap subsampling methods for finding empirical distribution of count features or ``moments'' (Bickel, Chen and Levina [Ann. Statist. 39 (2011) 2280-2301]) and smooth functions of these features for the networks. Using these methods, we cannot only estimate the variance of count features but also get good estimates of such feature counts, which are usually expensive to compute numerically in large networks. In our paper, we prove theoretical properties of the bootstrap estimates of variance of the count features as well as show their efficacy through simulation. We also use the method on some real network data for estimation of variance and expectation of some count features.

preprint2014arXiv

Community Detection in Networks using Graph Distance

The study of networks has received increased attention recently not only from the social sciences and statistics but also from physicists, computer scientists and mathematicians. One of the principal problem in networks is community detection. Many algorithms have been proposed for community finding but most of them do not have have theoretical guarantee for sparse networks and networks close to the phase transition boundary proposed by physicists. There are some exceptions but all have some incomplete theoretical basis. Here we propose an algorithm based on the graph distance of vertices in the network. We give theoretical guarantees that our method works in identifying communities for block models and can be extended for degree-corrected block models and block models with the number of communities growing with number of vertices. Despite favorable simulation results, we are not yet able to conclude that our method is satisfactory for worst possible case. We illustrate on a network of political blogs, Facebook networks and some other networks.

preprint2013arXiv

Pseudo-likelihood methods for community detection in large sparse networks

Many algorithms have been proposed for fitting network models with communities, but most of them do not scale well to large networks, and often fail on sparse networks. Here we propose a new fast pseudo-likelihood method for fitting the stochastic block model for networks, as well as a variant that allows for an arbitrary degree distribution by conditioning on degrees. We show that the algorithms perform well under a range of settings, including on very sparse networks, and illustrate on the example of a network of political blogs. We also propose spectral clustering with perturbations, a method of independent interest, which works well on sparse networks where regular spectral clustering fails, and use it to provide an initial value for pseudo-likelihood. We prove that pseudo-likelihood provides consistent estimates of the communities under a mild condition on the starting value, for the case of a block model with two communities.

preprint2012arXiv

The method of moments and degree distributions for network models

Probability models on graphs are becoming increasingly important in many applications, but statistical tools for fitting such models are not yet well developed. Here we propose a general method of moments approach that can be used to fit a large class of probability models through empirical counts of certain patterns in a graph. We establish some general asymptotic properties of empirical graph moments and prove consistency of the estimates as the graph size grows for all ranges of the average degree including $Ω(1)$. Additional results are obtained for the important special case of degree distributions.

preprint2011arXiv

Large Vector Auto Regressions

One popular approach for nonstructural economic and financial forecasting is to include a large number of economic and financial variables, which has been shown to lead to significant improvements for forecasting, for example, by the dynamic factor models. A challenging issue is to determine which variables and (their) lags are relevant, especially when there is a mixture of serial correlation (temporal dynamics), high dimensional (spatial) dependence structure and moderate sample size (relative to dimensionality and lags). To this end, an \textit{integrated} solution that addresses these three challenges simultaneously is appealing. We study the large vector auto regressions here with three types of estimates. We treat each variable's own lags different from other variables' lags, distinguish various lags over time, and is able to select the variables and lags simultaneously. We first show the consequences of using Lasso type estimate directly for time series without considering the temporal dependence. In contrast, our proposed method can still produce an estimate as efficient as an \textit{oracle} under such scenarios. The tuning parameters are chosen via a data driven "rolling scheme" method to optimize the forecasting performance. A macroeconomic and financial forecasting problem is considered to illustrate its superiority over existing estimators.

preprint2011arXiv

Leo Breiman: An important intellectual and personal force in statistics, my life and that of many others

I first met Leo Breiman in 1979 at the beginning of his third career, Professor of Statistics at Berkeley. He obtained his PhD with Loéve at Berkeley in 1957. His first career was as a probabilist in the Mathematics Department at UCLA. After distinguished research, including the Shannon--Breiman--MacMillan Theorem and getting tenure, he decided that his real interest was in applied statistics, so he resigned his position at UCLA and set up as a consultant. Before doing so he produced two classic texts, Probability, now reprinted as a SIAM Classic in Applied Mathematics, and Statistics. Both books reflected his strong opinion that intuition and rigor must be combined. He expressed this in his probability book which he viewed as a combination of his learning the right hand of probability, rigor, from Loéve, and the left-hand, intuition, from David Blackwell.

preprint2011arXiv

Measuring reproducibility of high-throughput experiments

Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the "irreproducible discovery rate" (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.

preprint2011arXiv

Subsampling Methods for genomic inference

Large-scale statistical analysis of data sets associated with genome sequences plays an important role in modern biology. A key component of such statistical analyses is the computation of $p$-values and confidence bounds for statistics defined on the genome. Currently such computation is commonly achieved through ad hoc simulation measures. The method of randomization, which is at the heart of these simulation procedures, can significantly affect the resulting statistical conclusions. Most simulation schemes introduce a variety of hidden assumptions regarding the nature of the randomness in the data, resulting in a failure to capture biologically meaningful relationships. To address the need for a method of assessing the significance of observations within large scale genomic studies, where there often exists a complex dependency structure between observations, we propose a unified solution built upon a data subsampling approach. We propose a piecewise stationary model for genome sequences and show that the subsampling approach gives correct answers under this model. We illustrate the method on three simulation studies and two real data examples.

preprint2010arXiv

Approximating the inverse of banded matrices by banded matrices with applications to probability and statistics

In the first part of this paper we give an elementary proof of the fact that if an infinite matrix $A$, which is invertible as a bounded operator on $\ell^2$, can be uniformly approximated by banded matrices then so can the inverse of $A$. We give explicit formulas for the banded approximations of $A^{-1}$ as well as bounds on their accuracy and speed of convergence in terms of their band-width. In the second part we apply these results to covariance matrices $Σ$ of Gaussian processes and study mixing and beta mixing of processes in terms of properties of $Σ$. Finally, we note some applications of our results to statistics.

preprint2010arXiv

Discussion of: Brownian distance covariance

Discussion on "Brownian distance covariance" by Gábor J. Székely and Maria L. Rizzo [arXiv:1010.0297]

preprint2010arXiv

Simultaneous analysis of Lasso and Dantzig selector

We exhibit an approximate equivalence between the Lasso estimator and Dantzig selector. For both methods we derive parallel oracle inequalities for the prediction risk in the general nonparametric regression model, as well as bounds on the $\ell_p$ estimation loss for $1\le p\le 2$ in the linear model when the number of variables can be much larger than the sample size.

Peter J. Bickel

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Measures of independence and functional dependence

An Assumption-Free Exact Test For Fixed-Design Linear Models With Exchangeable Errors

Generalized Pearson correlation squares for capturing mixtures of bivariate linear dependences

Hierarchical community detection by recursive partitioning

Asymptotics For High Dimensional Regression M-Estimates: Fixed Design Results

Likelihood-based model selection for stochastic block models

Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis

Spectral Clustering and Block Models: A Review And A New Algorithm

Subsampling bootstrap of count features of networks

Community Detection in Networks using Graph Distance

Pseudo-likelihood methods for community detection in large sparse networks

The method of moments and degree distributions for network models

Large Vector Auto Regressions

Leo Breiman: An important intellectual and personal force in statistics, my life and that of many others

Measuring reproducibility of high-throughput experiments

Subsampling Methods for genomic inference

Approximating the inverse of banded matrices by banded matrices with applications to probability and statistics

Discussion of: Brownian distance covariance

Simultaneous analysis of Lasso and Dantzig selector