Source author record

Xiao-Li Meng

Xiao-Li Meng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology math.ST Statistics Theory astro-ph.IM Computation Applications astro-ph.CO Cryptography and Security Machine Learning math.PR physics.data-an

Catalog footprint

What is connected

13works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

The LIRA-Ising Model: Estimating the boundaries of irregularly shaped X-ray sources

Mapping the boundary of an extended source is a key step in the study of its morphology. The background contamination and statistical fluctuations of typical astronomical images make this a challenging statistical task, particularly for X-ray images with low surface brightness. We develop a three-step Bayesian procedure to identify the boundaries of irregularly shaped sources. We first apply a Bayesian multiscale reconstruction algorithm known as LIRA to obtain posterior pixelwise probability distributions of the source intensity that properly account for known structures, astrophysical background, and the effect of the telescope point spread function. Next, we adopt an Ising model to group pixels with similar intensities into cohesive regions corresponding to background and source. Finally, the boundary is derived on the basis of the most likely aggregation of pixels into the source region. Because the overall model combines LIRA and the Ising model, we call it LIRA-Ising. We verify the proposed method using a set of simulation studies. We then apply it to the Chandra X-ray Observatory images of two high redshift quasars, PKS J1421-0643 and 0730+257, to determine the extent and morphology of X-ray jets. Our method shows a uniform X-ray surface brightness of PKS J1421-0643 jet, and identifies knotty structure in the X-ray jet of 0730+257.

preprint2024arXiv

Perfecting MCMC Sampling: Recipes and Reservations

This review paper is intended for the Handbook of Markov chain Monte Carlo's second edition. The authors will be grateful for any suggestions that could perfect it.

preprint2022arXiv

Scalable Spike-and-Slab

Spike-and-slab priors are commonly used for Bayesian variable selection, due to their interpretability and favorable statistical properties. However, existing samplers for spike-and-slab posteriors incur prohibitive computational costs when the number of variables is large. In this article, we propose Scalable Spike-and-Slab ($S^3$), a scalable Gibbs sampling implementation for high-dimensional Bayesian regression with the continuous spike-and-slab prior of George and McCulloch (1993). For a dataset with $n$ observations and $p$ covariates, $S^3$ has order $\max\{ n^2 p_t, np \}$ computational cost at iteration $t$ where $p_t$ never exceeds the number of covariates switching spike-and-slab states between iterations $t$ and $t-1$ of the Markov chain. This improves upon the order $n^2 p$ per-iteration cost of state-of-the-art implementations as, typically, $p_t$ is substantially smaller than $p$. We apply $S^3$ on synthetic and real-world datasets, demonstrating orders of magnitude speed-ups over existing exact samplers and significant gains in inferential quality over approximate samplers with comparable cost.

preprint2021arXiv

Multiple Improvements of Multiple Imputation Likelihood Ratio Tests

Multiple imputation (MI) inference handles missing data by imputing the missing values $m$ times, and then combining the results from the $m$ complete-data analyses. However, the existing method for combining likelihood ratio tests (LRTs) has multiple defects: (i) the combined test statistic can be negative, but its null distribution is approximated by an $F$-distribution; (ii) it is not invariant to re-parametrization; (iii) it fails to ensure monotonic power owing to its use of an inconsistent estimator of the fraction of missing information (FMI) under the alternative hypothesis; and (iv) it requires nontrivial access to the LRT statistic as a function of parameters instead of data sets. We show, using both theoretical derivations and empirical investigations, that essentially all of these problems can be straightforwardly addressed if we are willing to perform an additional LRT by stacking the $m$ completed data sets as one big completed data set. This enables users to implement the MI LRT without modifying the complete-data procedure. A particularly intriguing finding is that the FMI can be estimated consistently by an LRT statistic for testing whether the $m$ completed data sets can be regarded effectively as samples coming from a common model. Practical guidelines are provided based on an extensive comparison of existing MI tests. Issues related to nuisance parameters are also discussed.

preprint2021arXiv

Prior sample size extensions for assessing prior impact and prior--likelihood discordance

This paper outlines a framework for quantifying the prior's contribution to posterior inference in the presence of prior-likelihood discordance, a broader concept than the usual notion of prior-likelihood conflict. We achieve this dual purpose by extending the classic notion of \textit{prior sample size}, $M$, in three directions: (I) estimating $M$ beyond conjugate families; (II) formulating $M$ as a relative notion, i.e., as a function of the likelihood sample size $k, M(k),$ which also leads naturally to a graphical diagnosis; and (III) permitting negative $M$, as a measure of prior-likelihood conflict, i.e., harmful discordance. Our asymptotic regime permits the prior sample size to grow with the likelihood data size, hence making asymptotic arguments meaningful for investigating the impact of the prior relative to that of likelihood. It leads to a simple asymptotic formula for quantifying the impact of a proper prior that only involves computing a centrality and a spread measure of the prior and the posterior. We use simulated and real data to illustrate the potential of the proposed framework, including quantifying how weak is a "weakly informative" prior adopted in a study of lupus nephritis. Whereas we take a pragmatic perspective in assessing the impact of a prior on a given inference problem under a specific evaluative metric, we also touch upon conceptual and theoretical issues such as using improper priors and permitting priors with asymptotically non-vanishing influence.

preprint2020arXiv

Congenial Differential Privacy under Mandated Disclosure

Differentially private data releases are often required to satisfy a set of external constraints that reflect the legal, ethical, and logical mandates to which the data curator is obligated. The enforcement of constraints, when treated as post-processing, adds an extra phase in the production of privatized data. It is well understood in the theory of multi-phase processing that congeniality, a form of procedural compatibility between phases, is a prerequisite for the end users to straightforwardly obtain statistically valid results. Congenial differential privacy is theoretically principled, which facilitates transparency and intelligibility of the mechanism that would otherwise be undermined by ad-hoc post-processing procedures. We advocate for the systematic integration of mandated disclosure into the design of the privacy mechanism via standard probabilistic conditioning on the invariant margins. Conditioning automatically renders congeniality because any extra post-processing phase becomes unnecessary. We provide both initial theoretical guarantees and a Markov chain algorithm for our proposal. We also discuss intriguing theoretical issues that arise in comparing congenital differential privacy and optimization-based post-processing, as well as directions for further research.

preprint2015arXiv

An unexpected encounter with Cauchy and Lévy

The Cauchy distribution is usually presented as a mathematical curiosity, an exception to the Law of Large Numbers, or even as an "Evil" distribution in some introductory courses. It therefore surprised us when Drton and Xiao (2016) proved the following result for $m=2$ and conjectured it for $m\ge 3$. Let $X= (X_1,..., X_m)$ and $Y = (Y_1, ...,Y_m)$ be i.i.d $N(0,Σ)$, where $Σ=\{σ_{ij}\}\ge 0$ is an $m\times m$ and \textit{arbitrary} covariance matrix with $σ_{jj}>0$ for all $1\leq j\leq m$. Then $$Z = \sum_{j=1}^m w_j \frac{X_j}{Y_j} \ \sim \mathrm{Cauchy}(0,1),$$ as long as $w=(w_1,..., w_m) $ is independent of $(X, Y)$, $w_j\ge 0, j=1,..., m$, and $\sum_{j=1}^m w_j=1$. In this note, we present an elementary proof of this conjecture for any $m \geq 2$ by linking $Z$ to a geometric characterization of Cauchy(0,1) given in Willams (1969). This general result is essential to the large sample behavior of Wald tests in many applications such as factor models and contingency tables. It also leads to other unexpected results such as $$ \sum_{i=1}^m\sum_{j=1}^m \frac{w_iw_jσ_{ij}}{X_iX_j} \sim {\text{Lévy}}(0, 1). $$ This generalizes the "super Cauchy phenomenon" that the average of $m$ i.i.d. standard Lévy variables (i.e., inverse chi-squared variables with one degree of freedom) has the same distribution as that of a single standard Lévy variable multiplied by $m$ (which is obtained by taking $w_j=1/m$ and $Σ$ to be the identity matrix).

preprint2015arXiv

There is Individualized Treatment. Why Not Individualized Inference?

Doctors use statistics to advance medical knowledge; we use a medical analogy to introduce statistical inference "from scratch" and to highlight an improvement. Your doctor, perhaps implicitly, predicts the effectiveness of a treatment for you based on its performance in a clinical trial; the trial patients serve as controls for you. The same logic underpins statistical inference: to identify the best statistical procedure to use for a problem, we simulate a set of control problems and evaluate candidate procedures on the controls. Now for the improvement: recent interest in personalized/individualized medicine stems from the recognition that some clinical trial patients are better controls for you than others. Therefore, treatment decisions for you should depend only on a subset of relevant patients. Individualized statistical inference implements this idea for control problems (rather than patients). Its potential for improving data analysis matches personalized medicine's for improving healthcare. The central issue--for both individualized medicine and individualized inference--is how to make the right relevance robustness trade-off: if we exercise too much judgement in determining which controls are relevant, our inferences will not be robust. How much is too much? We argue that the unknown answer is the Holy Grail of statistical inference.

preprint2014arXiv

Strong Lens Time Delay Challenge: II. Results of TDC1

We present the results of the first strong lens time delay challenge. The motivation, experimental design, and entry level challenge are described in a companion paper. This paper presents the main challenge, TDC1, which consisted of analyzing thousands of simulated light curves blindly. The observational properties of the light curves cover the range in quality obtained for current targeted efforts (e.g.,~COSMOGRAIL) and expected from future synoptic surveys (e.g.,~LSST), and include simulated systematic errors. \nteamsA\ teams participated in TDC1, submitting results from \nmethods\ different method variants. After a describing each method, we compute and analyze basic statistics measuring accuracy (or bias) $A$, goodness of fit $χ^2$, precision $P$, and success rate $f$. For some methods we identify outliers as an important issue. Other methods show that outliers can be controlled via visual inspection or conservative quality control. Several methods are competitive, i.e., give $|A|<0.03$, $P<0.03$, and $χ^2<1.5$, with some of the methods already reaching sub-percent accuracy. The fraction of light curves yielding a time delay measurement is typically in the range $f = $20--40\%. It depends strongly on the quality of the data: COSMOGRAIL-quality cadence and light curve lengths yield significantly higher $f$ than does sparser sampling. Taking the results of TDC1 at face value, we estimate that LSST should provide around 400 robust time-delay measurements, each with $P<0.03$ and $|A|<0.01$, comparable to current lens modeling uncertainties. In terms of observing strategies, we find that $A$ and $f$ depend mostly on season length, while P depends mostly on cadence and campaign duration.

preprint2013arXiv

The potential and perils of preprocessing: Building new foundations

Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. However, it is rife with subtleties and pitfalls. Decisions made in preprocessing constrain all later analyses and are typically irreversible. Hence, data analysis becomes a collaborative endeavor by all parties involved in data collection, preprocessing and curation, and downstream inference. Even if each party has done its best given the information and resources available to them, the final result may still fall short of the best possible in the traditional single-phase inference framework. This is particularly relevant as we enter the era of "big data". The technologies driving this data explosion are subject to complex new forms of measurement error. Simultaneously, we are accumulating increasingly massive databases of scientific analyses. As a result, preprocessing has become more vital (and potentially more dangerous) than ever before.

preprint2011arXiv

Accounting for Calibration Uncertainties in X-ray Analysis: Effective Areas in Spectral Fitting

While considerable advance has been made to account for statistical uncertainties in astronomical analyses, systematic instrumental uncertainties have been generally ignored. This can be crucial to a proper interpretation of analysis results because instrumental calibration uncertainty is a form of systematic uncertainty. Ignoring it can underestimate error bars and introduce bias into the fitted values of model parameters. Accounting for such uncertainties currently requires extensive case-specific simulations if using existing analysis packages. Here we present general statistical methods that incorporate calibration uncertainties into spectral analysis of high-energy data. We first present a method based on multiple imputation that can be applied with any fitting method, but is necessarily approximate. We then describe a more exact Bayesian approach that works in conjunction with a Markov chain Monte Carlo based fitting. We explore methods for improving computational efficiency, and in particular detail a method of summarizing calibration uncertainties with a principal component analysis of samples of plausible calibration files. This method is implemented using recently codified Chandra effective area uncertainties for low-resolution spectral analysis and is verified using both simulated and actual Chandra data. Our procedure for incorporating effective area uncertainty is easily generalized to other types of calibration uncertainties.

preprint2011arXiv

Cross-Fertilizing Strategies for Better EM Mountain Climbing and DA Field Exploration: A Graphical Guide Book

In recent years, a variety of extensions and refinements have been developed for data augmentation based model fitting routines. These developments aim to extend the application, improve the speed and/or simplify the implementation of data augmentation methods, such as the deterministic EM algorithm for mode finding and stochastic Gibbs sampler and other auxiliary-variable based methods for posterior sampling. In this overview article we graphically illustrate and compare a number of these extensions, all of which aim to maintain the simplicity and computation stability of their predecessors. We particularly emphasize the usefulness of identifying similarities between the deterministic and stochastic counterparts as we seek more efficient computational strategies. We also demonstrate the applicability of data augmentation methods for handling complex models with highly hierarchical structure, using a high-energy high-resolution spectral imaging model for data from satellite telescopes, such as the Chandra X-ray Observatory.

preprint2010arXiv

Decoding the H-likelihood

Discussion of "Likelihood Inference for Models with Unobservables: Another View" by Youngjo Lee and John A. Nelder [arXiv:1010.0303]

Xiao-Li Meng

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

The LIRA-Ising Model: Estimating the boundaries of irregularly shaped X-ray sources

Perfecting MCMC Sampling: Recipes and Reservations

Scalable Spike-and-Slab

Multiple Improvements of Multiple Imputation Likelihood Ratio Tests

Prior sample size extensions for assessing prior impact and prior--likelihood discordance

Congenial Differential Privacy under Mandated Disclosure

An unexpected encounter with Cauchy and Lévy

There is Individualized Treatment. Why Not Individualized Inference?

Strong Lens Time Delay Challenge: II. Results of TDC1

The potential and perils of preprocessing: Building new foundations

Accounting for Calibration Uncertainties in X-ray Analysis: Effective Areas in Spectral Fitting

Cross-Fertilizing Strategies for Better EM Mountain Climbing and DA Field Exploration: A Graphical Guide Book

Decoding the H-likelihood