Researcher profile

Yukun Liu

Yukun Liu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

Minimum Wasserstein distance estimator under covariate shift: closed-form, super-efficiency and irregularity

Covariate shift arises when covariate distributions differ between source and target populations while the conditional distribution of the response remains invariant, and it underlies problems in missing data and causal inference. We propose a minimum Wasserstein distance estimation framework for inference under covariate shift that avoids explicit modeling of outcome regressions or importance weights. The resulting W-estimator admits a closed-form expression and is numerically equivalent to the classical 1-nearest neighbor estimator, yielding a new optimal transport interpretation of nearest neighbor methods. We establish root-$n$ asymptotic normality and show that the estimator is not asymptotically linear, leading to super-efficiency relative to the semiparametric efficient estimator under covariate shift in certain regimes, and uniformly in missing data problems. Numerical simulations, along with an analysis of a rainfall dataset, underscore the exceptional performance of our W-estimator.

preprint2022arXiv

Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data

Subsampling techniques can reduce the computational costs of processing big data. Practical subsampling plans typically involve initial uniform sampling and refined sampling. With a subsample, big data inferences are generally built on the inverse probability weighting (IPW), which becomes unstable when the probability weights are close to zero and cannot incorporate auxiliary information. First, we consider capture-recapture sampling, which combines an initial uniform sampling with a second Poisson sampling. Under this sampling plan, we propose an empirical likelihood weighting (ELW) estimation approach to an M-estimation parameter. Second, based on the ELW method, we construct a nearly optimal capture-recapture sampling plan that balances estimation efficiency and computational costs. Third, we derive methods for determining the smallest sample sizes with which the proposed sampling-and-estimation method produces estimators of guaranteed precision. Our ELW method overcomes the instability of IPW by circumventing the use of inverse probabilities, and utilizes auxiliary information including the size and certain sample moments of big data. We show that the proposed ELW method produces more efficient estimators than IPW, leading to more efficient optimal sampling plans and more economical sample sizes for a prespecified estimation precision. These advantages are confirmed through simulation studies and real data analyses.

preprint2022arXiv

Tuning-parameter-free optimal propensity score matching approach for causal inference

Propensity score matching (PSM) is a pseudo-experimental method that uses statistical techniques to construct an artificial control group by matching each treated unit with one or more untreated units of similar characteristics. To date, the problem of determining the optimal number of matches per unit, which plays an important role in PSM, has not been adequately addressed. We propose a tuning-parameter-free PSM method based on the nonparametric maximum-likelihood estimation of the propensity score under the monotonicity constraint. The estimated propensity score is piecewise constant, and therefore automatically groups data. Hence, our proposal is free of tuning parameters. The proposed estimator is asymptotically semiparametric efficient for the univariate case, and achieves this level of efficiency in the multivariate case when the outcome and the propensity score depend on the covariate in the same direction. We conclude that matching methods based on the propensity score alone cannot, in general, be efficient.

preprint2022arXiv

Weighted-average quantile regression

In this paper, we introduce the weighted-average quantile regression framework, $\int_0^1 q_{Y|X}(u)ψ(u)du = X'β$, where $Y$ is a dependent variable, $X$ is a vector of covariates, $q_{Y|X}$ is the quantile function of the conditional distribution of $Y$ given $X$, $ψ$ is a weighting function, and $β$ is a vector of parameters. We argue that this framework is of interest in many applied settings and develop an estimator of the vector of parameters $β$. We show that our estimator is $\sqrt T$-consistent and asymptotically normal with mean zero and easily estimable covariance matrix, where $T$ is the size of available sample. We demonstrate the usefulness of our estimator by applying it in two empirical settings. In the first setting, we focus on financial data and study the factor structures of the expected shortfalls of the industry portfolios. In the second setting, we focus on wage data and study inequality and social welfare dependence on commonly used individual characteristics.

preprint2020arXiv

A selective review on calibration information from similar studies based on parametric likelihood or empirical likelihood

In multi-center clinical trials, due to various reasons, the individual-level data are strictly restricted to be assessed publicly. Instead, the summarized information is widely available from published results. With the advance of computational technology, it has become very common in data analyses to run on hundreds or thousands of machines simultaneous, with the data distributed across those machines and no longer available in a single central location. How to effectively assemble the summarized clinical data information or information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. In this paper, we selectively review some recently-developed statistical methods, including communication efficient distributed statistical inference, and renewal estimation and incremental inference, which can be regarded as the latest development of calibration information methods in the era of big data. Even though those methods were developed in different fields and in different statistical frameworks, in principle, they are asymptotically equivalent to those well known methods developed in meta analysis. Almost no or little information is lost compared with the case when full data are available. As a general tool to integrate information, we also review the generalized method of moments and estimating equations approach by using empirical likelihood method.

preprint2020arXiv

Permutation tests under a rotating sampling plan with clustered data

Consider a population consisting of clusters of sampling units, evolving temporally, spatially, or according to other dynamics. We wish to monitor the evolution of its means, medians, or other parameters. For administrative convenience and informativeness, clustered data are often collected via a rotating plan. Under rotating plans, the observations in the same clusters are correlated, and observations on the same unit collected on different occasions are also correlated. Ignoring this correlation structure may lead to invalid inference procedures. Accommodating cluster structure in parametric models is difficult or will have a high level of misspecification risk. In this paper, we explore exchangeability in clustered data collected via a rotating sampling plan to develop a permutation scheme for testing various hypotheses of interest. We also introduce a semiparametric density ratio model to facilitate the multiple population structure in rotating sampling plans. The combination ensures the validity of the inference methods while extracting maximum information from the sampling plan. A simulation study indicates that the proposed tests firmly control the type I error whether or not the data are clustered. The use of the density ratio model improves the power of the tests.

preprint2010arXiv

Adjusted empirical likelihood with high-order precision

Empirical likelihood is a popular nonparametric or semi-parametric statistical method with many nice statistical properties. Yet when the sample size is small, or the dimension of the accompanying estimating function is high, the application of the empirical likelihood method can be hindered by low precision of the chi-square approximation and by nonexistence of solutions to the estimating equations. In this paper, we show that the adjusted empirical likelihood is effective at addressing both problems. With a specific level of adjustment, the adjusted empirical likelihood achieves the high-order precision of the Bartlett correction, in addition to the advantage of a guaranteed solution to the estimating equations. Simulation results indicate that the confidence regions constructed by the adjusted empirical likelihood have coverage probabilities comparable to or substantially more accurate than the original empirical likelihood enhanced by the Bartlett correction.