Source author record

Yijun Zuo

Yijun Zuo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Computation Methodology Machine Learning

Catalog footprint

What is connected

12works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Asymptotic normality of the least sum of squares of trimmed residuals estimator

To enhance the robustness of the classic least sum of squares (LS) of the residuals estimator, Zuo (2022) introduced the least sum of squares of trimmed (LST) residuals estimator. The LST enjoys many desired properties and serves well as a robust alternative to the LS. Its asymptotic properties, including strong and root-n consistency, have been established whereas the asymptotic normality is left unaddressed. This article solves this remained problem.

preprint2022arXiv

Non-asymptotic analysis and inference for an outlyingness induced winsorized mean

Robust estimation of a mean vector, a topic regarded as obsolete in the traditional robust statistics community, has recently surged in machine learning literature in the last decade. The latest focus is on the sub-Gaussian performance and computability of the estimators in a non-asymptotic setting. Numerous traditional robust estimators are computationally intractable, which partly contributes to the renewal of the interest in the robust mean estimation. Robust centrality estimators, however, include the trimmed mean and the sample median. The latter has the best robustness but suffers a low-efficiency drawback. Trimmed mean and median of means, %as robust alternatives to the sample mean, and achieving sub-Gaussian performance have been proposed and studied in the literature. This article investigates the robustness of leading sub-Gaussian estimators of mean and reveals that none of them can resist greater than $25\%$ contamination in data and consequently introduces an outlyingness induced winsorized mean which has the best possible robustness (can resist up to $50\%$ contamination without breakdown) meanwhile achieving high efficiency. Furthermore, it has a sub-Gaussian performance for uncontaminated samples and a bounded estimation error for contaminated samples at a given confidence level in a finite sample setting. It can be computed in linear time.

preprint2021arXiv

Computation of projection regression depth and its induced median

Notions of depth in regression have been introduced and studied in the literature. The most famous example is Regression Depth (RD), which is a direct extension of location depth to regression. The projection regression depth (PRD) is the extension of another prevailing location depth, the projection depth, to regression. The computation issues of the RD have been discussed in the literature. The computation issues of the PRD have never been dealt with before. The computation issues of the PRD and its induced median (maximum depth estimator) in a regression setting are addressed now. For a given $\bsβ\in\R^p$ exact algorithms for the PRD with cost $O(n^2\log n)$ ($p=2$) and $O(N(n, p)(p^{3}+n\log n+np^{1.5}+npN_{Iter}))$ ($p>2$) and approximate algorithms for the PRD and its induced median with cost respectively $O(N_{\mb{v}}np)$ and $O(Rp N_{\bsβ}(p^2+nN_{\mb{v}}N_{Iter}))$ are proposed. Here $N(n, p)$ is a number defined based on the total number of $(p-1)$ dimensional hyperplanes formed by points induced from sample points and the $\bsβ$; $N_{\mb{v}}$ is the total number of unit directions $\mb{v}$ utilized; $N_{\bsβ}$ is the total number of candidate regression parameters $\bsβ$ employed; $N_{Iter}$ is the total number of iterations carried out in an optimization algorithm; $R$ is the total number of replications. Furthermore, as the second major contribution, three PRD induced estimators, which can be computed up to 30 times faster than that of the PRD induced median while maintaining a similar level of accuracy are introduced. Examples and simulation studies reveal that the depth median induced from the PRD is favorable in terms of robustness and efficiency, compared to the maximum depth estimator induced from the RD, which is the current leading regression median.

preprint2020arXiv

Depth induced regression medians and uniqueness

Notion of median in one dimension is a foundational element in nonparametric statistics. It has been extended to multi-dimensional cases both in location and in regression via notions of data depth. Regression depth (RD) and projection regression depth (PRD) represent the two most promising notions in regression. Carrizosa depth $D_C$ is another depth notion in regression.Depth induced regression medians (maximum depth estimators) serve as robust alternatives to the classical least squares estimator. The uniqueness of regression medians is indispensable in the discussion of their properties and the asymptotics (consistency and limiting distribution) of sample regression medians. Are the regression medians induced from RD, PRD, and $D_C$ unique? Answering this question is the main goal of this article. It is found that only the regression median induced from PRD possesses the desired uniqueness property. The conventional remedy measure for non-uniqueness, taking average of all medians, might yield an estimator that no longer possesses the maximum depth in both RD and $D_C$ cases. These and other findings indicate that the PRD and its induced median are highly favorable among their leading competitors.

preprint2020arXiv

Exact computation of projection regression depth and fast computation of its induced median and other estimators

Zuo (2019) (Z19) addressed the computation of the projection regression depth (PRD) and its induced median (the maximum depth estimator). Z19 achieved the exact computation of PRD via a modified version of regular univariate sample median, which resulted in the loss of invariance of PRD and the equivariance of depth induced median. This article achieves the exact computation without scarifying the invariance of PRD and the equivariance of the regression median. Z19 also addressed the approximate computation of PRD induced median, the naive algorithm in Z19 is very slow. This article modifies the approximation in Z19 and adopts Rcpp package and consequently obtains a much (could be $100$ times) faster algorithm with an even better level of accuracy meanwhile. Furthermore, as the third major contribution, this article introduces three new depth induced estimators which can run $300$ times faster than that of Z19 meanwhile maintaining the same (or a bit better) level of accuracy. Real as well as simulated data examples are presented to illustrate the difference between the algorithms of Z19 and the ones proposed in this article. Findings support the statements above and manifest the major contributions of the article.

preprint2020arXiv

Large sample properties of the regression depth induced median

Notions of depth in regression have been introduced and studied in the literature. Regression depth (RD) of Rousseeuw and Hubert (1999), the most famous one, is a direct extension of Tukey location depth (Tukey (1975)) to regression. Like its location counterpart, the most remarkable advantage of the notion of depth in regression is to directly introduce the maximum (or deepest) regression depth estimator (aka depth induced median) for regression parameters in a multi-dimensional setting. Classical questions for the regression depth induced median include (i) is it a consistent estimator (or rather under what sufficient conditions, it is consistent)? and (ii) is there any limiting distribution? Bai and He (1999) (BH99) pioneered an attempt to answer these questions. Under some stringent conditions on (i) the design points, (ii) the conditional distributions of $y$ given $\bs{x}_i$, and (iii) the error distributions, BH99 proved the strong consistency of the depth induced median. Under another set of conditions, BH99 showed the existence of the limiting distribution of the estimator. This article establishes the strong consistency of the depth induced median without any of the stringent conditions in BH99, and proves the existence of the limiting distribution of the estimator by sufficient conditions and an approach different from BH99.

preprint2020arXiv

On general notions of depth for regression

Depth notions in location have fascinated tremendous attention in the literature. In fact data depth and its applications remain one of the most active research topics in statistics in the last two decades. Most favored notions of depth in location include Tukey (1975) halfspace depth (HD), Liu (1990) simplicial depth, and projection depth (Stahel (1981) and Donoho (1982), Liu (1992), Zuo and Serfling (2000) (ZS00) and Zuo (2003)), among others. Depth notions in regression have also been proposed, sporadically nevertheless. Regression depth (RD) of Rousseeuw and Hubert (1999) (RH99) is the most famous one which is a direct extension of Tukey HD to regression. Others include Carrizosa (1996) and the ones induced from Marrona and Yohai (1993) (MY93) proposed in this article. Is there any relationship between Carrizosa depth and RD of RH99? Do these depth notions possess desirable properties? What are the desirable properties? Can existing notions really serve as depth functions in regression? These questions remain open. Revealing the equivalence between Carrizosa depth and RD of RH99; expanding location depth evaluating criteria in ZS00 for regression depth notions; examining the existing regression notions with respect to the gauges; and proposing the regression counterpart of the eminent projection depth in location are the four major objectives of the article.

preprint2016arXiv

Some results on the computing of Tukey's halfspace medain

Depth of the Tukey median is investigated for empirical distributions. A sharper upper bound is provided for this value for data sets in general position. This bound is lower than the existing one in the literature, and more importantly derived under the \emph{fixed} sample size practical scenario. Several results obtained in this paper are interesting theoretically and useful as well to reduce the computational burden of the Tukey median practically when $p$ is large relative to large $n$.

preprint2016arXiv

The limit of finite sample breakdown point of Tukey's halfspace median for general data

Under special conditions on data set and underlying distribution, the limit of finite sample breakdown point of Tukey's halfspace median ($\frac{1} {3}$) has been obtained in literature. In this paper, we establish the result under \emph{weaker assumption} imposed on underlying distribution (halfspace symmetry) and on data set (not necessary in general position). The representation of Tukey's sample depth regions for data set \emph{not necessary in general position} is also obtained, as a by-product of our derivation.

preprint2011arXiv

Exactly computing bivariate projection depth contours and median

Among their competitors, projection depth and its induced estimators are very favorable because they can enjoy very high breakdown point robustness without having to pay the price of low efficiency, meanwhile providing a promising center-outward ordering of multi-dimensional data. However, their further applications have been severely hindered due to their computational challenge in practice. In this paper, we derive a simple form of the projection depth function, when (μ, σ) = (Med, MAD). This simple form enables us to extend the existing result of point-wise exact computation of projection depth (PD) of Zuo and Lai (2011) to depth contours and median for bivariate data.

preprint2010arXiv

Discussion of "Multivariate quantiles and multiple-output regression quantiles: From $L_1$ optimization to halfspace depth"

Discussion of "Multivariate quantiles and multiple-output regression quantiles: From $L_1$ optimization to halfspace depth" by M. Hallin, D. Paindaveine and M. Siman [arXiv:1002.4486]

preprint2007arXiv

On the limiting distributions of multivariate depth-based rank sum statistics and related tests

A depth-based rank sum statistic for multivariate data introduced by Liu and Singh [J. Amer. Statist. Assoc. 88 (1993) 252--260] as an extension of the Wilcoxon rank sum statistic for univariate data has been used in multivariate rank tests in quality control and in experimental studies. Those applications, however, are based on a conjectured limiting distribution, provided by Liu and Singh [J. Amer. Statist. Assoc. 88 (1993) 252--260]. The present paper proves the conjecture under general regularity conditions and, therefore, validates various applications of the rank sum statistic in the literature. The paper also shows that the corresponding rank sum tests can be more powerful than Hotelling's T^2 test and some commonly used multivariate rank tests in detecting location-scale changes in multivariate distributions.

Yijun Zuo

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Asymptotic normality of the least sum of squares of trimmed residuals estimator

Non-asymptotic analysis and inference for an outlyingness induced winsorized mean

Computation of projection regression depth and its induced median

Depth induced regression medians and uniqueness

Exact computation of projection regression depth and fast computation of its induced median and other estimators

Large sample properties of the regression depth induced median

On general notions of depth for regression

Some results on the computing of Tukey's halfspace medain

The limit of finite sample breakdown point of Tukey's halfspace median for general data

Exactly computing bivariate projection depth contours and median

Discussion of "Multivariate quantiles and multiple-output regression quantiles: From $L_1$ optimization to halfspace depth"

On the limiting distributions of multivariate depth-based rank sum statistics and related tests