Source author record

Art B. Owen

Art B. Owen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.NA Computation Methodology Numerical Analysis math.ST Statistics Theory Applications Machine Learning Computational Complexity econ.EM Quantitative Methods Social and Information Networks

Catalog footprint

What is connected

29works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Quasi-Monte Carlo with one categorical variable

We study randomized quasi-Monte Carlo (RQMC) estimation of a multivariate integral where one of the variables takes only a finite number of values. This problem arises when the variable of integration is drawn from a mixture distribution as is common in importance sampling and also arises in some recent work on transport maps. We find that when integration error decreases at an RQMC rate that it is then important to oversample the smallest mixture components instead of using a proportional allocation. This can even improve the rate of convergence. The optimal allocations depend on the possibly unknown convergence rate. Designing the sample with an incorrect assumption on the rate still attains that convergence rate, with an inferior implied constant. The penalty for using a pessimistic rate is typically higher than for using an optimistic one. We also find that for the most accurate RQMC sampling methods, it is advantageous to arrange that our $n=2^m$ randomized Sobol' points split into subsample sizes that are also powers of $2$.

preprint2024arXiv

Computable error bounds for quasi-Monte Carlo using points with non-negative local discrepancy

Let $f:[0,1]^d\to\mathbb{R}$ be a completely monotone integrand as defined by Aistleitner and Dick (2015) and let points $\boldsymbol{x}_0,\dots,\boldsymbol{x}_{n-1}\in[0,1]^d$ have a non-negative local discrepancy (NNLD) everywhere in $[0,1]^d$. We show how to use these properties to get a non-asymptotic and computable upper bound for the integral of $f$ over $[0,1]^d$. An analogous non-positive local discrepancy (NPLD) property provides a computable lower bound. It has been known since Gabai (1967) that the two dimensional Hammersley points in any base $b\ge2$ have non-negative local discrepancy. Using the probabilistic notion of associated random variables, we generalize Gabai's finding to digital nets in any base $b\ge2$ and any dimension $d\ge1$ when the generator matrices are permutation matrices. We show that permutation matrices cannot attain the best values of the digital net quality parameter when $d\ge3$. As a consequence the computable absolutely sure bounds we provide come with less accurate estimates than the usual digital net estimates do in high dimensions. We are also able to construct high dimensional rank one lattice rules that are NNLD. We show that those lattices do not have good discrepancy properties: any lattice rule with the NNLD property in dimension $d\ge2$ either fails to be projection regular or has all its points on the main diagonal. Complete monotonicity is a very strict requirement that for some integrands can be mitigated via a control variate.

preprint2022arXiv

Pre-integration via Active Subspaces

Pre-integration is an extension of conditional Monte Carlo to quasi-Monte Carlo and randomized quasi-Monte Carlo. It can reduce but not increase the variance in Monte Carlo. For quasi-Monte Carlo it can bring about improved regularity of the integrand with potentially greatly improved accuracy. Pre-integration is ordinarily done by integrating out one of $d$ input variables to a function. In the common case of a Gaussian integral one can also pre-integrate over any linear combination of variables. We propose to do that and we choose the first eigenvector in an active subspace decomposition to be the pre-integrated linear combination. We find in numerical examples that this active subspace pre-integration strategy is competitive with pre-integrating the first variable in the principal components construction on the Asian option where principal components are known to be very effective. It outperforms other pre-integration methods on some basket options where there is no well established default. We show theoretically that, just as in Monte Carlo, pre-integration can reduce but not increase the variance when one uses scrambled net integration. We show that the lead eigenvector in an active subspace decomposition is closely related to the vector that maximizes a less computationally tractable criterion using a Sobol' index to find the most important linear combination of Gaussian variables. They optimize similar expectations involving the gradient. We show that the Sobol' index criterion for the leading eigenvector is invariant to the way that one chooses the remaining $d-1$ eigenvectors with which to sample the Gaussian vector.

preprint2022arXiv

Super-polynomial accuracy of multidimensional randomized nets using the median-of-means

We study approximate integration of a function $f$ over $[0,1]^s$ based on taking the median of $2r-1$ integral estimates derived from independently randomized $(t,m,s)$-nets in base $2$. The nets are randomized by Matousek's random linear scramble with a digital shift. If $f$ is analytic over $[0,1]^s$, then the probability that any one randomized net's estimate has an error larger than $2^{-cm^2/s}$ times a quantity depending on $f$ is $O(1/\sqrt{m})$ for any $c<3\log(2)/π^2\approx 0.21$. As a result the median of the distribution of these scrambled nets has an error that is $O(n^{-c\log(n)/s})$ for $n=2^m$ function evaluations. The sample median of $2r-1$ independent draws attains this rate too, so long as $r/m^2$ is bounded away from zero as $m\to\infty$. We include results for finite precision estimates and some non-asymptotic comparisons to taking the mean of $2r-1$ independent draws.

preprint2022arXiv

Super-polynomial accuracy of one dimensional randomized nets using the median-of-means

Let $f$ be analytic on $[0,1]$ with $|f^{(k)}(1/2)|\leq Aα^kk!$ for some constant $A$ and $α<2$. We show that the median estimate of $μ=\int_0^1f(x)\,\mathrm{d}x$ under random linear scrambling with $n=2^m$ points converges at the rate $O(n^{-c\log(n)})$ for any $c< 3\log(2)/π^2\approx 0.21$. We also get a super-polynomial convergence rate for the sample median of $2k-1$ random linearly scrambled estimates, when $k=Ω(m)$. When $f$ has a $p$'th derivative that satisfies a $λ$-Hölder condition then the median-of-means has error $O( n^{-(p+λ)+ε})$ for any $ε>0$, if $k\to\infty$ as $m\to\infty$.

preprint2022arXiv

Where are the logs?

The commonly quoted error rates for QMC integration with an infinite low discrepancy sequence is $O(n^{-1}\log(n)^r)$ with $r=d$ for extensible sequences and $r=d-1$ otherwise. Such rates hold uniformly over all $d$ dimensional integrands of Hardy-Krause variation one when using $n$ evaluation points. Implicit in those bounds is that for any sequence of QMC points, the integrand can be chosen to depend on $n$. In this paper we show that rates with any $r<(d-1)/2$ can hold when $f$ is held fixed as $n\to\infty$. This is accomplished following a suggestion of Erich Novak to use some unpublished results of Trojan from the 1980s as given in the information based complexity monograph of Traub, Wasilkowski and Woźniakowski. The proof is made by applying a technique of Roth with the theorem of Trojan. The proof is non constructive and we do not know of any integrand of bounded variation in the sense of Hardy and Krause for which the QMC error exceeds $(\log n)^{1+ε}/n$ for infinitely many $n$ when using a digital sequence such as one of Sobol's. An empirical search when $d=2$ for integrands designed to exploit known weaknesses in certain point sets showed no evidence that $r>1$ is needed. An example with $d=3$ and $n$ up to $2^{100}$ might possibly require $r>1$.

preprint2021arXiv

Combining randomized field experiments with observational satellite data to assess the benefits of crop rotations on yields

With climate change threatening agricultural productivity and global food demand increasing, it is important to better understand which farm management practices will maximize crop yields in various climatic conditions. To assess the effectiveness of agricultural practices, researchers often turn to randomized field experiments, which are reliable for identifying causal effects but are often limited in scope and therefore lack external validity. Recently, researchers have also leveraged large observational datasets from satellites and other sources, which can lead to conclusions biased by confounding variables or systematic measurement errors. Because experimental and observational datasets have complementary strengths, in this paper we propose a method that uses a combination of experimental and observational data in the same analysis. As a case study, we focus on the causal effect of crop rotation on corn (maize) and soy yields in the Midwestern United States. We find that, in terms of root mean squared error, our hybrid method performs 13% better than using experimental data alone and 26% better than using the observational data alone in the task of predicting the effect of rotation on corn yield at held-out experimental sites. Further, the causal estimates based on our method suggest that benefits of crop rotations on corn yield are lower in years and locations with high temperatures whereas the benefits of crop rotations on soy yield are higher in years and locations with high temperatures. In particular, we estimated that the benefit of rotation on corn yields (and soy yields) was 0.84 t/ha (0.23 t/ha) on average for the top quintile of temperatures, 1.02 t/ha (0.20 t/ha) on average for the whole dataset, and 1.18 t/ha (0.15 t/ha) on average for the bottom quintile of temperatures.

preprint2020arXiv

A strong law of large numbers for scrambled net integration

This article provides a strong law of large numbers for integration on digital nets randomized by a nested uniform scramble. The motivating problem is optimization over some variables of an integral over others, arising in Bayesian optimization. This strong law requires that the integrand have a finite moment of order $p$ for some $p>1$. Previously known results implied a strong law only for Riemann integrable functions. Previous general weak laws of large numbers for scrambled nets require a square integrable integrand. We generalize from $L^2$ to $L^p$ for $p>1$ via the Riesz-Thorin interpolation theorem

preprint2020arXiv

Efficient estimation of the ANOVA mean dimension, with an application to neural net classification

The mean dimension of a black box function of $d$ variables is a convenient way to summarize the extent to which it is dominated by high or low order interactions. It is expressed in terms of $2^d-1$ variance components but it can be written as the sum of $d$ Sobol' indices that can be estimated by leave one out methods. We compare the variance of these leave one out methods: a Gibbs sampler called winding stairs, a radial sampler that changes each variable one at a time from a baseline, and a naive sampler that never reuses function evaluations and so costs about double the other methods. For an additive function the radial and winding stairs are most efficient. For a multiplicative function the naive method can easily be most efficient if the factors have high kurtosis. As an illustration we consider the mean dimension of a neural network classifier of digits from the MNIST data set. The classifier is a function of $784$ pixels. For that problem, winding stairs is the best algorithm. We find that inputs to the final softmax layer have mean dimensions ranging from $1.35$ to $2.0$.

preprint2020arXiv

Optimizing the tie-breaker regression discontinuity design

Motivated by customer loyalty plans and scholarship programs, we study tie-breaker designs which are hybrids of randomized controlled trials (RCTs) and regression discontinuity designs (RDDs). We quantify the statistical efficiency of a tie-breaker design in which a proportion $Δ$ of observed subjects are in the RCT. In a two line regression, statistical efficiency increases monotonically with $Δ$, so efficiency is maximized by an RCT. We point to additional advantages of tie-breakers versus RDD: for a nonparametric regression the boundary bias is much less severe and for quadratic regression, the variance is greatly reduced. For a two line model we can quantify the short term value of the treatment allocation and this comparison favors smaller $Δ$ with the RDD being best. We solve for the optimal tradeoff between these exploration and exploitation goals. The usual tie-breaker design applies an RCT on the middle $Δ$ subjects as ranked by the assignment variable. We quantify the efficiency of other designs such as experimenting only in the second decile from the top. We also show that in some general parametric models a Monte Carlo evaluation can be replaced by matrix algebra.

preprint2016arXiv

Admissibility in Partial Conjunction Testing

Meta-analysis combines results from multiple studies aiming to increase power in finding their common effect. It would typically reject the null hypothesis of no effect if any one of the studies shows strong significance. The partial conjunction null hypothesis is rejected only when at least $r$ of $n$ component hypotheses are non-null with $r = 1$ corresponding to a usual meta-analysis. Compared with meta-analysis, it can encourage replicable findings across studies. A by-product of it when applied to different $r$ values is a confidence interval of $r$ quantifying the proportion of non-null studies. Benjamini and Heller (2008) provided a valid test for the partial conjunction null by ignoring the $r - 1$ smallest p-values and applying a valid meta-analysis p-value to the remaining $n - r + 1$ p-values. We provide sufficient and necessary conditions of admissible combined p-value for the partial conjunction hypothesis among monotone tests. Non-monotone tests always dominate monotone tests but are usually too unreasonable to be used in practice. Based on these findings, we propose a generalized form of Benjamini and Heller's test which allows usage of various types of meta-analysis p-values, and apply our method to an example in assessing replicable benefit of new anticoagulants across subgroups of patients for stroke prevention.

preprint2016arXiv

Confounder Adjustment in Multiple Hypothesis Testing

We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g. treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 and LEAPP, which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true non-nulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.

preprint2016arXiv

Multibrand geographic experiments

In a geographic experiment to measure advertising effectiveness, some regions (hereafter GEOs) get increased advertising while others do not. This paper looks at running $B>1$ such experiments simultaneously on $B$ different brands in $G$ GEOs, and then using shrinkage methods to estimate returns to advertising. There are important practical gains from doing this. Data from any one brand helps to estimate the return of all other brands. We see this in both a frequentist and Bayesian formulation. As a result, each individual experiment could be made smaller and less expensive when they are analyzed together. We also provide an experimental design for multibrand experiments where half of the brands have increased spend in each GEO while half of the GEOs have increased spend for each brand. For $G>B$ the design is a two level factorial for each brand and simultaneously a supersaturated design for the GEOs. Multiple simultaneous experiments also allow one to identify GEOs in which advertising is generally more effective. That cannot be done in the single brand experiments we consider.

preprint2015arXiv

A constraint on extensible quadrature rules

When the worst case integration error in a family of functions decays as $n^{-α}$ for some $α>1$ and simple averages along an extensible sequence match that rate at a set of sample sizes $n_1<n_2<\dots<\infty$, then these sample sizes must grow at least geometrically. More precisely, $n_{k+1}/n_k\ge ρ$ must hold for a value $1<ρ<2$ that increases with $α$. This result always rules out arithmetic sequences but never rules out sample size doubling. The same constraint holds in a root mean square setting.

preprint2015arXiv

Transformations and Hardy-Krause variation

Using a multivariable Faa di Bruno formula we give conditions on transformations $τ:[0,1]^m\to\mathcal{X}$ where $\mathcal{X}$ is a closed and bounded subset of $\mathbb{R}^d$ such that $f\circτ$ is of bounded variation in the sense of Hardy and Krause for all $f\in C^d(\mathcal{x})$. We give similar conditions for $f\circτ$ to be smooth enough for scrambled net sampling to attain $O(n^{-3/2+ε})$ accuracy. Some popular symmetric transformations to the simplex and sphere are shown to satisfy neither condition. Some other transformations due to Fang and Wang (1993) satisfy the first but not the second condition. We provide transformations for the simplex that makes $f\circτ$ smooth enough to fully benefit from scrambled net sampling for all $f$ in a class of generalized polynomials. We also find sufficient conditions for the Rosenblatt-Hlawka-Mück transformation in $\mathbb{R}^2$ and for importance sampling to be of bounded variation in the sense of Hardy and Krause.

preprint2014arXiv

Data enriched linear regression

We present a linear regression method for predictions on a small data set making use of a second possibly biased data set that may be much larger. Our method fits linear regressions to the two data sets while penalizing the difference between predictions made by those two models. The resulting algorithm is a shrinkage method similar to those used in small area estimation. We find a Stein-type finding for Gaussian responses: when the model has 5 or more coefficients and 10 or more error degrees of freedom, it becomes inadmissible to use only the small data set, no matter how large the bias is. We also present both plug-in and AICc-based methods to tune our penalty parameter. Most of our results use an $L_2$ penalty, but we obtain formulas for $L_1$ penalized estimates when the model is specialized to the location setting. Ordinary Stein shrinkage provides an inadmissibility result for only 3 or more coefficients, but we find that our shrinkage method typically produces much lower squared errors in as few as 5 or 10 dimensions when the bias is small and essentially equivalent squared errors when the bias is large.

preprint2014arXiv

Extensible grids: uniform sampling on a space-filling curve

We study the properties of points in $[0,1]^d$ generated by applying Hilbert's space-filling curve to uniformly distributed points in $[0,1]$. For deterministic sampling we obtain a discrepancy of $O(n^{-1/d})$ for $d\ge2$. For random stratified sampling, and scrambled van der Corput points, we get a mean squared error of $O(n^{-1-2/d})$ for integration of Lipshitz continuous integrands, when $d\ge3$. These rates are the same as one gets by sampling on $d$ dimensional grids and they show a deterioration with increasing $d$. The rate for Lipshitz functions is however best possible at that level of smoothness and is better than plain IID sampling. Unlike grids, space-filling curve sampling provides points at any desired sample size, and the van der Corput version is extensible in $n$. Additionally we show that certain discontinuous functions with infinite variation in the sense of Hardy and Krause can be integrated with a mean squared error of $O(n^{-1-1/d})$. It was previously known only that the rate was $o(n^{-1})$. Other space-filling curves, such as those due to Sierpinski and Peano, also attain these rates, while upper bounds for the Lebesgue curve are somewhat worse, as if the dimension were $\log_2(3)$ times as high.

preprint2014arXiv

Low discrepancy constructions in the triangle

Most quasi-Monte Carlo research focuses on sampling from the unit cube. Many problems, especially in computer graphics, are defined via quadrature over the unit triangle. Quasi-Monte Carlo methods for the triangle have been developed by Pillands and Cools (2005) and by Brandolini et al. (2013). This paper presents two QMC constructions in the triangle with a vanishing discrepancy. The first is a version of the van der Corput sequence customized to the unit triangle. It is an extensible digital construction that attains a discrepancy below 12/sqrt(N). The second construction rotates an integer lattice through an angle whose tangent is a quadratic irrational number. It attains a discrepancy of O(log(N)/N) which is the best possible rate. Previous work strongly indicated that such a discrepancy was possible, but no constructions were available. Scrambling the digits of the first construction improves its accuracy for integration of smooth functions. Both constructions also yield convergent estimates for integrands that are Riemann integrable on the triangle without requiring bounded variation.

preprint2014arXiv

Moment based gene set tests

{\bf Motivation:} Permutation-based gene set tests are standard approaches for testing relationshi ps between collections of related genes and an outcome of interest in high throughput expression analyses. Using $M$ random permutations, one can attain $p$-values as small as $1/(M+1)$. When many gene sets are tested, we need smaller $p$-values, hence larger $M$, to achieve significance while accounting for the n umber of simultaneous tests being made. As a result, the number of permutations to be done rises along with the cost per permutation. To reduce this cost, we seek parametric approximations to the permutation distributions for gene set tes ts. {\bf Results:} We focus on two gene set methods related to sums and sums of squared $t$ statistics. Our approach calculates exact relevant moments of a weighted sum of (squared) test statistics under permutation. We find moment-based gene set enrichment $p$-values that closely approximate the permutation method $p$-values. The computational cost of our algorithm for linear statistics is on the order of doing $|G|$ permutations, where $|G|$ is the number of genes in set $G$. For the quadratic statistics, the cost is on the order of $|G|^2$ permutations which is orders of magnitude faster than naive permutation. We applied the permutation approximation method to three public Parkinson's Disease expression datasets and discovered enriched gene sets not previously discussed. In the analysis of these experiments with our method, we are able to remove the granularity effects of permutation analyses and have a substantial computational speedup with little cost to accura cy. {\bf Availability:} Methods available as a Bioconductor package, npGSEA (www.bioconductor.org). {\bf Contact:} {larson.jessica@gene.com} \end{abstract}

preprint2014arXiv

Optimal mixture weights in multiple importance sampling

In multiple importance sampling we combine samples from a finite list of proposal distributions. When those proposal distributions are used to create control variates, it is possible (Owen and Zhou, 2000) to bound the ratio of the resulting variance to that of the unknown best proposal distribution in our list. The minimax regret arises by taking a uniform mixture of proposals, but that is conservative when there are many components. In this paper we optimize the mixture component sampling rates to gain further efficiency. We show that the sampling variance of mixture importance sampling with control variates is jointly convex in the mixture probabilities and control variate regression coefficients. We also give a sequential importance sampling algorithm to estimate the optimal mixture from the sample data.

preprint2014arXiv

The sign of the logistic regression coefficient

Let Y be a binary random variable and X a scalar. Let $\hatβ$ be the maximum likelihood estimate of the slope in a logistic regression of Y on X with intercept. Further let $\bar x_0$ and $\bar x_1$ be the average of sample x values for cases with y=0 and y=1, respectively. Then under a condition that rules out separable predictors, we show that sign($\hatβ$) = sign($\bar x_1-\bar x_0$). More generally, if $x_i$ are vector valued then we show that $\hatβ=0$ if and only if $\bar x_1=\bar x_0$. This holds for logistic regression and also for more general binary regressions with inverse link functions satisfying a log-concavity condition. Finally, when $\bar x_1\ne \bar x_0$ then the angle between $\hatβ$ and $\bar x_1-\bar x_0$ is less than ninety degrees in binary regressions satisfying the log-concavity condition and the separation condition, when the design matrix has full rank.

preprint2013arXiv

Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data

In high throughput settings we inspect a great many candidate variables (e.g., genes) searching for associations with a primary variable (e.g., a phenotype). High throughput hypothesis testing can be made difficult by the presence of systemic effects and other latent variables. It is well known that those variables alter the level of tests and induce correlations between tests. They also change the relative ordering of significance levels among hypotheses. Poor rankings lead to wasteful and ineffective follow-up studies. The problem becomes acute for latent variables that are correlated with the primary variable. We propose a two-stage analysis to counter the effects of latent variables on the ranking of hypotheses. Our method, called LEAPP, statistically isolates the latent variables from the primary one. In simulations, it gives better ordering of hypotheses than competing methods such as SVA and EIGENSTRAT. For an illustration, we turn to data from the AGEMAP study relating gene expression to age for 16 tissues in the mouse. LEAPP generates rankings with greater consistency across tissues than the rankings attained by the other methods.

preprint2012arXiv

Better estimation of small Sobol' sensitivity indices

A new method for estimating Sobol' indices is proposed. The new method makes use of 3 independent input vectors rather than the usual 2. It attains much greater accuracy on problems where the target Sobol' index is small, even outperforming some oracles which adjust using the true but unknown mean of the function. When the target Sobol' index is quite large, the oracles do better than the new method.

preprint2012arXiv

Bootstrapping data arrays of arbitrary order

In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the levels of each factor, giving each observation the product of independently sampled factor weights. No exact bootstrap exists for this problem [McCullagh (2000) Bernoulli 6 285-301]. We show that the proposed bootstrap is mildly conservative, meaning biased toward overestimating the variance, under sufficient conditions that allow very unbalanced and heteroscedastic inputs. Earlier results for a resampling bootstrap only apply to two factors and use multinomial weights that are poorly suited to online computation. The proposed reweighting approach can be implemented in parallel and online settings. The results for this method apply to any number of factors. The method is illustrated using a 3 factor data set of comment lengths from Facebook.

preprint2012arXiv

Variance components and generalized Sobol' indices

This paper introduces generalized Sobol' indices, compares strategies for their estimation, and makes a systematic search for efficient estimators. Of particular interest are contrasts, sums of squares and indices of bilinear form which allow a reduced number of function evaluations compared to alternatives. The bilinear framework includes some efficient estimators from Saltelli (2002) and Mauntz (2002) as well as some new estimators for specific variance components and mean dimensions. This paper also provides a bias corrected version of the estimator of Janon et al.\,(2012) and extends the bias correction to generalized Sobol' indices. Some numerical comparisons are given.

preprint2011arXiv

Correct ordering in the Zipf-Poisson ensemble

We consider a Zipf--Poisson ensemble in which $X_i\sim\poi(Ni^{-α})$ for $α>1$ and $N>0$ and integers $i\ge 1$. As $N\to\infty$ the first $n'(N)$ random variables have their proper order $X_1>X_2>...>X_{n'}$ relative to each other, with probability tending to 1 for $n'$ up to $(AN/\log(N))^{1/(α+2)}$ for an explicit constant $A(α)\ge 3/4$. The rate $N^{1/(α+2)}$ cannot be achieved. The ordering of the first $n'(N)$ entities does not preclude $X_m>X_{n'}$ for some interloping $m>n'$. The first $n"$ random variables are correctly ordered exclusive of any interlopers, with probability tending to 1 if $n"\le (BN/\log(N))^{1/(α+2)}$ for $B<A$. For a Zipf--Poisson model of the British National Corpus, which has a total word count of $100{,}000{,}000$, our result estimates that the 72 words with the highest counts are properly ordered.

preprint2011arXiv

Moment based estimation of stochastic Kronecker graph parameters

Stochastic Kronecker graphs supply a parsimonious model for large sparse real world graphs. They can specify the distribution of a large random graph using only three or four parameters. Those parameters have however proved difficult to choose in specific applications. This article looks at method of moments estimators that are computationally much simpler than maximum likelihood. The estimators are fast and in our examples, they typically yield Kronecker parameters with expected feature counts closer to a given graph than we get from KronFit. The improvement was especially prominent for the number of triangles in the graph.

preprint2011arXiv

Outlier Detection Using Nonconvex Penalized Regression

This paper studies the outlier detection problem from the point of view of penalized regressions. Our regression model adds one mean shift parameter for each of the $n$ data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual $L_1$ penalty yields a convex criterion, but we find that it fails to deliver a robust estimator. The $L_1$ penalty corresponds to soft thresholding. We introduce a thresholding (denoted by $Θ$) based iterative procedure for outlier detection ($Θ$-IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We find that $Θ$-IPOD is much faster than iteratively reweighted least squares for large data because each iteration costs at most $O(np)$ (and sometimes much less) avoiding an $O(np^2)$ least squares estimate. We describe the connection between $Θ$-IPOD and $M$-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on BIC. The tuned $Θ$-IPOD shows outstanding performance in identifying outliers in various situations in comparison to other existing approaches. This methodology extends to high-dimensional modeling with $p\gg n$, if both the coefficient vector and the outlier pattern are sparse.

preprint2010arXiv

Empirical stationary correlations for semi-supervised learning on graphs

In semi-supervised learning on graphs, response variables observed at one node are used to estimate missing values at other nodes. The methods exploit correlations between nearby nodes in the graph. In this paper we prove that many such proposals are equivalent to kriging predictors based on a fixed covariance matrix driven by the link structure of the graph. We then propose a data-driven estimator of the correlation structure that exploits patterns among the observed response values. By incorporating even a small fraction of observed covariation into the predictions, we are able to obtain much improved prediction on two graph data sets.

Art B. Owen

What is connected

Connect this record

See the researcher in context

Building this map preview

29 published item(s)

Quasi-Monte Carlo with one categorical variable

Computable error bounds for quasi-Monte Carlo using points with non-negative local discrepancy

Pre-integration via Active Subspaces

Super-polynomial accuracy of multidimensional randomized nets using the median-of-means

Super-polynomial accuracy of one dimensional randomized nets using the median-of-means

Where are the logs?

Combining randomized field experiments with observational satellite data to assess the benefits of crop rotations on yields

A strong law of large numbers for scrambled net integration

Efficient estimation of the ANOVA mean dimension, with an application to neural net classification

Optimizing the tie-breaker regression discontinuity design

Admissibility in Partial Conjunction Testing

Confounder Adjustment in Multiple Hypothesis Testing

Multibrand geographic experiments

A constraint on extensible quadrature rules

Transformations and Hardy-Krause variation

Data enriched linear regression

Extensible grids: uniform sampling on a space-filling curve

Low discrepancy constructions in the triangle

Moment based gene set tests

Optimal mixture weights in multiple importance sampling

The sign of the logistic regression coefficient

Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data

Better estimation of small Sobol' sensitivity indices

Bootstrapping data arrays of arbitrary order

Variance components and generalized Sobol' indices

Correct ordering in the Zipf-Poisson ensemble

Moment based estimation of stochastic Kronecker graph parameters

Outlier Detection Using Nonconvex Penalized Regression

Empirical stationary correlations for semi-supervised learning on graphs