Researcher profile

Guang Cheng

Guang Cheng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
30works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

30 published item(s)

preprint2026arXiv

From Table to Cell: Attention for Better Reasoning with TABALIGN

Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.

preprint2026arXiv

Let the Target Select for Itself: Data Selection via Target-Aligned Paths

Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.

preprint2026arXiv

MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs

Large Language Models exhibit strong reasoning and semantic understanding capabilities but often hallucinate in domains that require expert knowledge, among which fabrications, the generation of factually incorrect yet fluent statements, pose the greatest risk in medical contexts. Existing medical hallucination datasets inadequately capture fabrication phenomena due to limited fabrication coverage, stylistic disparities between human and LLM-authored texts, and distributional drift during hallucinated sample synthesis. To address this, we propose a data-centric pipeline to generate realistic and word-level fabrications that preserve syntactic and stylistic fidelity while introducing subtle factual deviations, resulting in MedFabric. Building upon this dataset, we introduce ETHER, a modular word-level fabrication detector integrating Text2Table Decomposition, Word Masking and Filling and Hybrid Sentence Pair Evaluation to enhance factual alignment. Empirical results demonstrate that MedFabric outperforms state-of-the-art detectors by over 15% on word-level fabrication benchmarks while maintaining consistent performance across structural similarities, offering a comprehensive framework for reliable and domain-specific factuality detection.

preprint2024arXiv

Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models

Devising procedures for downstream task-oriented generative model selections is an unresolved problem of practical importance. Existing studies focused on the utility of a single family of generative models. They provided limited insights on how synthetic data practitioners select the best family generative models for synthetic training tasks given a specific combination of machine learning model class and performance metric. In this paper, we approach the downstream task-oriented generative model selections problem in the case of training fraud detection models and investigate the best practice given different combinations of model interpretability and model performance constraints. Our investigation supports that, while both Neural Network(NN)-based and Bayesian Network(BN)-based generative models are both good to complete synthetic training task under loose model interpretability constrain, the BN-based generative models is better than NN-based when synthetic training fraud detection model under strict model interpretability constrain. Our results provides practical guidance for machine learning practitioner who is interested in replacing their training dataset from real to synthetic, and shed lights on more general downstream task-oriented generative model selection problems.

preprint2024arXiv

Improve Fidelity and Utility of Synthetic Credit Card Transaction Time Series from Data-centric Perspective

Exploring generative model training for synthetic tabular data, specifically in sequential contexts such as credit card transaction data, presents significant challenges. This paper addresses these challenges, focusing on attaining both high fidelity to actual data and optimal utility for machine learning tasks. We introduce five pre-processing schemas to enhance the training of the Conditional Probabilistic Auto-Regressive Model (CPAR), demonstrating incremental improvements in the synthetic data's fidelity and utility. Upon achieving satisfactory fidelity levels, our attention shifts to training fraud detection models tailored for time-series data, evaluating the utility of the synthetic data. Our findings offer valuable insights and practical guidelines for synthetic data practitioners in the finance sector, transitioning from real to synthetic datasets for training purposes, and illuminating broader methodologies for synthesizing credit card transaction time series.

preprint2023arXiv

Optimal Convergence Rates of Deep Convolutional Neural Networks: Additive Ridge Functions

Convolutional neural networks have shown impressive abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates (up to a log factor). The input dimension only appears in the constant of convergence rates. This work shows the statistical optimality of convolutional neural networks and may shed light on why convolutional neural networks are able to behave well for high dimensional input.

preprint2023arXiv

Ranking Differential Privacy

Rankings are widely collected in various real-life scenarios, leading to the leakage of personal information such as users' preferences on videos or news. To protect rankings, existing works mainly develop privacy protection on a single ranking within a set of ranking or pairwise comparisons of a ranking under the $ε$-differential privacy. This paper proposes a novel notion called $ε$-ranking differential privacy for protecting ranks. We establish the connection between the Mallows model (Mallows, 1957) and the proposed $ε$-ranking differential privacy. This allows us to develop a multistage ranking algorithm to generate synthetic rankings while satisfying the developed $ε$-ranking differential privacy. Theoretical results regarding the utility of synthetic rankings in the downstream tasks, including the inference attack and the personalized ranking tasks, are established. For the inference attack, we quantify how $ε$ affects the estimation of the true ranking based on synthetic rankings. For the personalized ranking task, we consider varying privacy preferences among users and quantify how their privacy preferences affect the consistency in estimating the optimal ranking function. Extensive numerical experiments are carried out to verify the theoretical results and demonstrate the effectiveness of the proposed synthetic ranking algorithm.

preprint2022arXiv

Attention Enables Zero Approximation Error

Deep learning models have been widely applied in various aspects of daily life. Many variant models based on deep learning structures have achieved even better performances. Attention-based architectures have become almost ubiquitous in deep learning structures. Especially, the transformer model has now defeated the convolutional neural network in image classification tasks to become the most widely used tool. However, the theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal. These surprising theoretical results clearly explain the outstanding performances of the transformer model and may shed light on future modifications in real applications. We also provide some experiments to verify our theoretical result.

preprint2022arXiv

Benefit of Interpolation in Nearest Neighbor Algorithms

In some studies \citep[e.g.,][]{zhang2016understanding} of deep learning, it is observed that over-parametrized deep neural networks achieve a small testing error even when the training error is almost zero. Despite numerous works towards understanding this so-called "double descent" phenomenon \citep[e.g.,][]{belkin2018reconciling,belkin2019two}, in this paper, we turn into another way to enforce zero training error (without over-parametrization) through a data interpolation mechanism. Specifically, we consider a class of interpolated weighting schemes in the nearest neighbors (NN) algorithms. By carefully characterizing the multiplicative constant in the statistical risk, we reveal a U-shaped performance curve for the level of data interpolation in both classification and regression setups. This sharpens the existing result \citep{belkin2018does} that zero training error does not necessarily jeopardize predictive performances and claims a counter-intuitive result that a mild degree of data interpolation actually {\em strictly} improve the prediction performance and statistical stability over those of the (un-interpolated) $k$-NN algorithm. In the end, the universality of our results, such as change of distance measure and corrupted testing data, will also be discussed.

preprint2022arXiv

Enhanced Nearest Neighbor Classification for Crowdsourcing

In machine learning, crowdsourcing is an economical way to label a large amount of data. However, the noise in the produced labels may deteriorate the accuracy of any classification method applied to the labelled data. We propose an enhanced nearest neighbor classifier (ENN) to overcome this issue. Two algorithms are developed to estimate the worker quality (which is often unknown in practice): one is to construct the estimate based on the denoised worker labels by applying the $k$NN classifier to the expert data; the other is an iterative algorithm that works even without access to the expert data. Other than strong numerical evidence, our proposed methods are proven to achieve the same regret as its oracle version based on high-quality expert data. As a technical by-product, a lower bound on the sample size assigned to each worker to reach the optimal convergence rate of regret is derived.

preprint2022arXiv

Fair Bayes-Optimal Classifiers Under Predictive Parity

Increasing concerns about disparate effects of AI have motivated a great deal of work on fair machine learning. Existing works mainly focus on independence- and separation-based measures (e.g., demographic parity, equality of opportunity, equalized odds), while sufficiency-based measures such as predictive parity are much less studied. This paper considers predictive parity, which requires equalizing the probability of success given a positive prediction among different protected groups. We prove that, if the overall performances of different groups vary only moderately, all fair Bayes-optimal classifiers under predictive parity are group-wise thresholding rules. Perhaps surprisingly, this may not hold if group performance levels vary widely; in this case we find that predictive parity among protected groups may lead to within-group unfairness. We then propose an algorithm we call FairBayes-DPP, aiming to ensure predictive parity when our condition is satisfied. FairBayes-DPP is an adaptive thresholding algorithm that aims to achieve predictive parity, while also seeking to maximize test accuracy. We provide supporting experiments conducted on synthetic and empirical data.

preprint2022arXiv

Federated Online Sparse Decision Making

This paper presents a novel federated linear contextual bandits model, where individual clients face different K-armed stochastic bandits with high-dimensional decision context and coupled through common global parameters. By leveraging the sparsity structure of the linear reward , a collaborative algorithm named \texttt{Fedego Lasso} is proposed to cope with the heterogeneity across clients without exchanging local decision context vectors or raw reward data. \texttt{Fedego Lasso} relies on a novel multi-client teamwork-selfish bandit policy design, and achieves near-optimal regrets for shared parameter cases with logarithmic communication costs. In addition, a new conceptual tool called federated-egocentric policies is introduced to delineate exploration-exploitation trade-off. Experiments demonstrate the effectiveness of the proposed algorithms on both synthetic and real-world datasets.

preprint2022arXiv

Minimax Optimal Deep Neural Network Classifiers Under Smooth Decision Boundary

Deep learning has gained huge empirical successes in large-scale classification problems. In contrast, there is a lack of statistical understanding about deep learning methods, particularly in the minimax optimality perspective. For instance, in the classical smooth decision boundary setting, existing deep neural network (DNN) approaches are rate-suboptimal, and it remains elusive how to construct minimax optimal DNN classifiers. Moreover, it is interesting to explore whether DNN classifiers can circumvent the curse of dimensionality in handling high-dimensional data. The contributions of this paper are two-fold. First, based on a localized margin framework, we discover the source of suboptimality of existing DNN approaches. Motivated by this, we propose a new deep learning classifier using a divide-and-conquer technique: DNN classifiers are constructed on each local region and then aggregated to a global one. We further propose a localized version of the classical Tsybakov's noise condition, under which statistical optimality of our new classifier is established. Second, we show that DNN classifiers can adapt to low-dimensional data structures and circumvent the curse of dimensionality in the sense that the minimax rate only depends on the effective dimension, potentially much smaller than the actual data dimension. Numerical experiments are conducted on simulated data to corroborate our theoretical results.

preprint2022arXiv

Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning

The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.

preprint2022arXiv

Optimal False Discovery Control of Minimax Estimator

Two major research tasks lie at the heart of high dimensional data analysis: accurate parameter estimation and correct support recovery. The existing literature mostly aims for either the best parameter estimation or the best model selection result, however little has been done to understand the potential interaction between the estimation precision and the selection behavior. In this work, our minimax result shows that an estimator's performance of type I error control directly links with its $L_2$ estimation error rate, and reveals a trade-off phenomenon between the rate of convergence and the false discovery control: to achieve better accuracy, one risks yielding more false discoveries. In particular, we characterize the false discovery control behavior of rate optimal and rate suboptimal estimators under different sparsity regimes, and discover a rigid dichotomy between these two estimators under near-linear and linear sparsity settings. In addition, this work provides a rigorous explanation to the incompatibility phenomenon between selection consistency and rate minimaxity which has been frequently observed in the high dimensional literature.

preprint2022arXiv

Residual Bootstrap Exploration for Stochastic Linear Bandit

We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.

preprint2022arXiv

Unlabeled Data Help: Minimax Analysis and Adversarial Robustness

The recent proposed self-supervised learning (SSL) approaches successfully demonstrate the great potential of supplementing learning algorithms with additional unlabeled data. However, it is still unclear whether the existing SSL algorithms can fully utilize the information of both labelled and unlabeled data. This paper gives an affirmative answer for the reconstruction-based SSL algorithm \citep{lee2020predicting} under several statistical models. While existing literature only focuses on establishing the upper bound of the convergence rate, we provide a rigorous minimax analysis, and successfully justify the rate-optimality of the reconstruction-based SSL algorithm under different data generation models. Furthermore, we incorporate the reconstruction-based SSL into the existing adversarial training algorithms and show that learning from unlabeled data helps improve the robustness.

preprint2021arXiv

Stein Neural Sampler

We propose two novel samplers to generate high-quality samples from a given (un-normalized) probability density. Motivated by the success of generative adversarial networks, we construct our samplers using deep neural networks that transform a reference distribution to the target distribution. Training schemes are developed to minimize two variations of the Stein discrepancy, which is designed to work with un-normalized densities. Once trained, our samplers are able to generate samples instantaneously. We show that the proposed methods are theoretically sound and experience fewer convergence issues compared with traditional sampling approaches according to our empirical studies.

preprint2020arXiv

Adaptive Variational Bayesian Inference for Sparse Deep Neural Network

In this work, we focus on variational Bayesian inference on the sparse Deep Neural Network (DNN) modeled under a class of spike-and-slab priors. Given a pre-specified sparse DNN structure, the corresponding variational posterior contraction rate is characterized that reveals a trade-off between the variational error and the approximation error, which are both determined by the network structural complexity (i.e., depth, width and sparsity). However, the optimal network structure, which strikes the balance of the aforementioned trade-off and yields the best rate, is generally unknown in reality. Therefore, our work further develops an {\em adaptive} variational inference procedure that can automatically select a reasonably good (data-dependent) network structure that achieves the best contraction rate, without knowing the optimal network structure. In particular, when the true function is H{ö}lder smooth, the adaptive variational inference is capable to attain (near-)optimal rate without the knowledge of smoothness level. The above rate still suffers from the curse of dimensionality, and thus motivates the teacher-student setup, i.e., the true function is a sparse DNN model, under which the rate only logarithmically depends on the input dimension.

preprint2020arXiv

Building an Efficient Intrusion Detection System Based on Feature Selection and Ensemble Classifier

Intrusion detection system (IDS) is one of extensively used techniques in a network topology to safeguard the integrity and availability of sensitive assets in the protected systems. Although many supervised and unsupervised learning approaches from the field of machine learning have been used to increase the efficacy of IDSs, it is still a problem for existing intrusion detection algorithms to achieve good performance. First, lots of redundant and irrelevant data in high-dimensional datasets interfere with the classification process of an IDS. Second, an individual classifier may not perform well in the detection of each type of attacks. Third, many models are built for stale datasets, making them less adaptable for novel attacks. Thus, we propose a new intrusion detection framework in this paper, and this framework is based on the feature selection and ensemble learning techniques. In the first step, a heuristic algorithm called CFS-BA is proposed for dimensionality reduction, which selects the optimal subset based on the correlation between features. Then, we introduce an ensemble approach that combines C4.5, Random Forest (RF), and Forest by Penalizing Attributes (Forest PA) algorithms. Finally, voting technique is used to combine the probability distributions of the base learners for attack recognition. The experimental results, using NSL-KDD, AWID, and CIC-IDS2017 datasets, reveal that the proposed CFS-BA-Ensemble method is able to exhibit better performance than other related and state of the art approaches under several metrics.

preprint2020arXiv

Finite Time Analysis of Vector Autoregressive Models under Linear Restrictions

This paper develops a unified finite-time theory for the ordinary least squares estimation of possibly unstable and even slightly explosive vector autoregressive models under linear restrictions, with the applicable region $ρ(A)\leq 1+c/n$, where $ρ(A)$ is the spectral radius of the transition matrix $A$ in the \VAR(1) representation, $n$ is the time horizon and $c>0$ is a universal constant. The linear restriction framework encompasses various existing models such as banded/network vector autoregressive models. We show that the restrictions reduce the error bounds via not only the reduced dimensionality but also a scale factor resembling the asymptotic covariance matrix of the estimator in the fixed-dimensional setup: as long as the model is correctly specified, this scale factor is decreasing in the number of restrictions. It is revealed that the phase transition from slow to fast error rate regimes is determined by the smallest singular value of $A$, a measure of the least excitable mode of the system. The minimax lower bounds are derived across different regimes. The developed non-asymptotic theory not only bridges the theoretical gap between stable and unstable regimes but precisely characterizes the effect of restrictions and its interplay with model parameters. Simulations support our theoretical results.

preprint2020arXiv

Moderate-Dimensional Inferences on Quadratic Functionals in Ordinary Least Squares

Statistical inferences for quadratic functionals of linear regression parameter have found wide applications including signal detection, global testing, inferences of error variance and fraction of variance explained. Classical theory based on ordinary least squares estimator works perfectly in the low-dimensional regime, but fails when the parameter dimension $p_n$ grows proportionally to the sample size $n$. In some cases, its performance is not satisfactory even when $n\ge 5p_n$. The main contribution of this paper is to develop {\em dimension-adaptive} inferences for quadratic functionals when $\lim_{n\to \infty} p_n/n=τ\in[0,1)$. We propose a bias-and-variance-corrected test statistic and demonstrate that its theoretical validity (such as consistency and asymptotic normality) is adaptive to both low dimension with $τ= 0$ and moderate dimension with $τ\in(0, 1)$. Our general theory holds, in particular, without Gaussian design/error or structural parameter assumption, and applies to a broad class of quadratic functionals covering all aforementioned applications. As a by-product, we find that the classical fixed-dimensional results continue to hold {\em if and only if} the signal-to-noise ratio is large enough, say when $p_n$ diverges but slower than $n$. Extensive numerical results demonstrate the satisfactory performance of the proposed methodology even when $p_n\ge 0.9n$ in some extreme cases. The mathematical arguments are based on the random matrix theory and leave-one-observation-out method.

preprint2020arXiv

On Deep Instrumental Variables Estimate

The endogeneity issue is fundamentally important as many empirical applications may suffer from the omission of explanatory variables, measurement error, or simultaneous causality. Recently, \cite{hllt17} propose a "Deep Instrumental Variable (IV)" framework based on deep neural networks to address endogeneity, demonstrating superior performances than existing approaches. The aim of this paper is to theoretically understand the empirical success of the Deep IV. Specifically, we consider a two-stage estimator using deep neural networks in the linear instrumental variables model. By imposing a latent structural assumption on the reduced form equation between endogenous variables and instrumental variables, the first-stage estimator can automatically capture this latent structure and converge to the optimal instruments at the minimax optimal rate, which is free of the dimension of instrumental variables and thus mitigates the curse of dimensionality. Additionally, in comparison with classical methods, due to the faster convergence rate of the first-stage estimator, the second-stage estimator has {a smaller (second order) estimation error} and requires a weaker condition on the smoothness of the optimal instruments. Given that the depth and width of the employed deep neural network are well chosen, we further show that the second-stage estimator achieves the semiparametric efficiency bound. Simulation studies on synthetic data and application to automobile market data confirm our theory.

preprint2020arXiv

Online Batch Decision-Making with High-Dimensional Covariates

We propose and investigate a class of new algorithms for sequential decision making that interacts with \textit{a batch of users} simultaneously instead of \textit{a user} at each decision epoch. This type of batch models is motivated by interactive marketing and clinical trial, where a group of people are treated simultaneously and the outcomes of the whole group are collected before the next stage of decision. In such a scenario, our goal is to allocate a batch of treatments to maximize treatment efficacy based on observed high-dimensional user covariates. We deliver a solution, named \textit{Teamwork LASSO Bandit algorithm}, that resolves a batch version of explore-exploit dilemma via switching between teamwork stage and selfish stage during the whole decision process. This is made possible based on statistical properties of LASSO estimate of treatment efficacy that adapts to a sequence of batch observations. In general, a rate of optimal allocation condition is proposed to delineate the exploration and exploitation trade-off on the data collection scheme, which is sufficient for LASSO to identify the optimal treatment for observed user covariates. An upper bound on expected cumulative regret of the proposed algorithm is provided.

preprint2020arXiv

Predictive Power of Nearest Neighbors Algorithm under Random Perturbation

We consider a data corruption scenario in the classical $k$ Nearest Neighbors ($k$-NN) algorithm, that is, the testing data are randomly perturbed. Under such a scenario, the impact of corruption level on the asymptotic regret is carefully characterized. In particular, our theoretical analysis reveals a phase transition phenomenon that, when the corruption level $ω$ is below a critical order (i.e., small-$ω$ regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large-$ω$ regime), the asymptotic regret deteriorates polynomially. Surprisingly, we obtain a negative result that the classical noise-injection approach will not help improve the testing performance in the beginning stage of the large-$ω$ regime, even in the level of the multiplicative constant of asymptotic regret. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in \cite{xue2017achieving} will at most achieve a sub-optimal rate when the data dimension $d>4$ even if $k$ is chosen optimally in the pre-processing step.

preprint2020arXiv

Residual Bootstrap Exploration for Bandit Algorithms

In this paper, we propose a novel perturbation-based exploration method in bandit algorithms with bounded or unbounded rewards, called residual bootstrap exploration (\texttt{ReBoot}). The \texttt{ReBoot} enforces exploration by injecting data-driven randomness through a residual-based perturbation mechanism. This novel mechanism captures the underlying distributional properties of fitting errors, and more importantly boosts exploration to escape from suboptimal solutions (for small sample sizes) by inflating variance level in an \textit{unconventional} way. In theory, with appropriate variance inflation level, \texttt{ReBoot} provably secures instance-dependent logarithmic regret in Gaussian multi-armed bandits. We evaluate the \texttt{ReBoot} in different synthetic multi-armed bandits problems and observe that the \texttt{ReBoot} performs better for unbounded rewards and more robustly than \texttt{Giro} \cite{kveton2018garbage} and \texttt{PHE} \cite{kveton2019perturbed}, with comparable computational efficiency to the Thompson sampling method.

preprint2020arXiv

Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting

Classifiers built with neural networks handle large-scale high dimensional data, such as facial images from computer vision, extremely well while traditional statistical methods often fail miserably. In this paper, we attempt to understand this empirical success in high dimensional classification by deriving the convergence rates of excess risk. In particular, a teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks. In this setup, we obtain a sharp rate of convergence, i.e., $\tilde{O}_d(n^{-2/3})$, for classifiers trained using either 0-1 loss or hinge loss. This rate can be further improved to $\tilde{O}_d(n^{-1})$ when the data distribution is separable. Here, $n$ denotes the sample size. An interesting observation is that the data dimension only contributes to the $\log(n)$ term in the above rates. This may provide one theoretical explanation for the empirical successes of deep neural networks in high dimensional classification, particularly for structured data.

preprint2020arXiv

Simultaneous Inference for Massive Data: Distributed Bootstrap

In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling, typically required by existing methods \cite{kleiner2014scalable,sengupta2016subsampled}, while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly re-fitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.

preprint2020arXiv

Sparse and Low-rank Tensor Estimation via Cubic Sketchings

In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.

preprint2020arXiv

Sparse Confidence Sets for Normal Mean Models

In this paper, we propose a new framework to construct confidence sets for a $d$-dimensional unknown sparse parameter $θ$ under the normal mean model $X\sim N(θ,σ^2I)$. A key feature of the proposed confidence set is its capability to account for the sparsity of $θ$, thus named as {\em sparse} confidence set. This is in sharp contrast with the classical methods, such as Bonferroni confidence intervals and other resampling based procedures, where the sparsity of $θ$ is often ignored. Specifically, we require the desired sparse confidence set to satisfy the following two conditions: (i) uniformly over the parameter space, the coverage probability for $θ$ is above a pre-specified level; (ii) there exists a random subset $S$ of $\{1,...,d\}$ such that $S$ guarantees the pre-specified true negative rate (TNR) for detecting nonzero $θ_j$'s. To exploit the sparsity of $θ$, we define that the confidence interval for $θ_j$ degenerates to a single point 0 for any $j\notin S$. Under this new framework, we first consider whether there exist sparse confidence sets that satisfy the above two conditions. To address this question, we establish a non-asymptotic minimax lower bound for the non-coverage probability over a suitable class of sparse confidence sets. The lower bound deciphers the role of sparsity and minimum signal-to-noise ratio (SNR) in the construction of sparse confidence sets. Furthermore, under suitable conditions on the SNR, a two-stage procedure is proposed to construct a sparse confidence set. To evaluate the optimality, the proposed sparse confidence set is shown to attain a minimax lower bound of some properly defined risk function up to a constant factor. Finally, we develop an adaptive procedure to the unknown sparsity and SNR. Numerical studies are conducted to verify the theoretical results.