Source author record

Yun Yang

Yun Yang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

45works

23topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Accumulation of Sub-Sampling Matrices with Applications to Statistical Computation

With appropriately chosen sampling probabilities, sampling-based random projection can be used to implement large-scale statistical methods, substantially reducing computational cost while maintaining low statistical error. However, computing optimal sampling probabilities is often itself expensive, and in practice one typically resorts to suboptimal schemes. This generally leads to increased time and space costs, as more subsamples are required and the resulting projection matrices become larger, thereby making the inference procedure more computationally demanding. In this paper, we extend the framework of sampling-based random projection and propose a new projection method, \emph{accumulative sub-sampling}. By carefully accumulating multiple such projections, accumulative sub-sampling improves statistical efficiency while controlling the effective matrix size throughout the statistical computation. On the theoretical side, we quantify how the quality of the subsampling scheme affects the error in approximating matrix products and positive semidefinite matrices, and show how the proposed accumulation strategy mitigates this effect. Moreover, we apply our method to statistical models involving intensive matrix operations, such as eigendecomposition in spectral clustering and matrix inversion in kernel ridge regression, and demonstrate that reducing the effective matrix size leads to substantial computational savings. Numerical experiments across a range of problems further show that our approach consistently improves computational efficiency compared to existing random projection baselines under suboptimal sampling schemes.

preprint2026arXiv

Toward Scalable Terminal Task Synthesis via Skill Graphs

Terminal agents have demonstrated strong potential for autonomous command-line execution, yet their training remains constrained by the scarcity of high-quality and diverse execution trajectories. Existing approaches mitigate this bottleneck by synthesizing large-scale terminal task instances for trajectory sampling. However, they primarily focus on scaling the number of tasks while providing limited control over the diversity of execution trajectories that agents actually experience during training. In this paper, we present SkillSynth, an automated framework for terminal task synthesis built on a scenario-mediated skill graph. SkillSynth first constructs a large-scale skill graph, where scenarios serve as intermediate transition nodes that connect diverse command-line skills. It then samples paths from this graph as abstractions of real-world workflows, and uses a multi-agent harness to instantiate them into executable task instances. By grounding task synthesis in graph-sampled workflow paths, SkillSynth explicitly controls the diversity of minimal execution trajectories required to solve the synthesized tasks. Experiments on Terminal-Bench demonstrate the effectiveness of SkillSynth. Moreover, task instances synthesized by SkillSynth have been adopted to train Hy3 Preview, contributing to its enhanced agentic capabilities in terminal-based settings.

preprint2024arXiv

A Practical Beamforming Design for Active RIS-assisted MU-MISO Systems

Reconfigurable Intelligent Surfaces (RIS) have been proposed as a revolutionary technology with the potential to address several critical requirements of 6G communication systems. Despite its powerful ability for radio environment reconfiguration, the ``double fading'' effect constricts the practical system performance enhancements due to the significant path loss. A new active RIS architecture has been recently proposed to overcome this challenge. However, existing active RIS studies rely on an ideal amplification model without considering the practical hardware limitation of amplifiers, which may cause performance degradation using such inaccurate active RIS modeling. Motivated by this fact, in this paper we first investigate the amplification principle of typical active RIS and propose a more accurate amplification model based on amplifier hardware characteristics. Then, based on the new amplification model, we propose a novel joint transmit beamforming and RIS reflection beamforming design considering the incident signal power on practical active RIS for multiuser multi-input single-output (MU-MISO) communication system. Fractional programming (FP), majorization minimization (MM) and block coordinate descent (BCD) methods are used to solve for the complex problem. Simulation results indicate the importance of the consideration of practical amplifier hardware characteristics in the joint beamforming designs and demonstrate the effectiveness of the proposed algorithm compared to other benchmarks.

preprint2023arXiv

Estimating Distributions with Low-dimensional Structures Using Mixtures of Generative Models

There has been a growing interest in statistical inference from data satisfying the so-called manifold hypothesis, assuming data points in the high-dimensional ambient space to lie in close vicinity of a submanifold of much lower dimension. In machine learning, encoder-decoder pair based generative modelling approaches have been successful in learning complicated high-dimensional distributions such as those over images and texts by explicitly imposing the low-dimensional manifold structure. In this work, we introduce a new approach for estimating distributions on unknown submanifolds via mixtures of generative models. We show that conventional generative modeling approaches using a single encoder-decoder pair are generally unable to capture data distributions under the manifold hypothesis, unless the underlying manifold admits a global parametrization; however, this issue can be solved by using a collection of encoder-decoder pairs for learning different local patches of the data supporting manifold. A rigorous theoretical analysis is developed to demonstrate that the proposed estimator attains the minimax-optimal rate of convergence for the implicit estimation of data distributions with manifold structures. Our experiments show that, by utilizing parameter sharing, the proposed method can significantly improve the performance of conventional auto-encoder based generative modelling approaches with minimal additional computational efforts.

preprint2022arXiv

Cost-effective Land Cover Classification for Remote Sensing Images

Land cover maps are of vital importance to various fields such as land use policy development, ecosystem services, urban planning and agriculture monitoring, which are mainly generated from remote sensing image classification techniques. Traditional land cover classification usually needs tremendous computational resources, which often becomes a huge burden to the remote sensing community. Undoubtedly cloud computing is one of the best choices for land cover classification, however, if not managed properly, the computation cost on the cloud could be surprisingly high. Recently, cutting the unnecessary computation long tail has become a promising solution for saving the cost in the cloud. For land cover classification, it is generally not necessary to achieve the best accuracy and 85% can be regarded as a reliable land cover classification. Therefore, in this paper, we propose a framework for cost-effective remote sensing classification. Given the desired accuracy, the clustering algorithm can stop early for cost-saving whilst achieving sufficient accuracy for land cover image classification. Experimental results show that achieving 85%-99.9% accuracy needs only 27.34%-60.83% of the total cloud computation cost for achieving a 100% accuracy. To put it into perspective, for the US land cover classification example, the proposed approach can save over $1,593,490.18 for the government in each single-use when the desired accuracy is 90%.

preprint2022arXiv

High-Dimensional Linear Regression via Implicit Regularization

Many statistical estimators for high-dimensional linear regression are M-estimators, formed through minimizing a data-dependent square loss function plus a regularizer. This work considers a new class of estimators implicitly defined through a discretized gradient dynamic system under overparameterization. We show that under suitable restricted isometry conditions, overparameterization leads to implicit regularization: if we directly apply gradient descent to the residual sum of squares with sufficiently small initial values, then under some proper early stopping rule, the iterates converge to a nearly sparse rate-optimal solution that improves over explicitly regularized approaches. In particular, the resulting estimator does not suffer from extra bias due to explicit penalties, and can achieve the parametric root-n rate when the signal-to-noise ratio is sufficiently high. We also perform simulations to compare our methods with high dimensional linear regression with explicit regularization. Our results illustrate the advantages of using implicit regularization via gradient descent after overparameterization in sparse vector estimation.

preprint2022arXiv

Hypernetwork Dismantling via Deep Reinforcement Learning

Network dismantling aims to degrade the connectivity of a network by removing an optimal set of nodes. It has been widely adopted in many real-world applications such as epidemic control and rumor containment. However, conventional methods usually focus on simple network modeling with only pairwise interactions, while group-wise interactions modeled by hypernetwork are ubiquitous and critical. In this work, we formulate the hypernetwork dismantling problem as a node sequence decision problem and propose a deep reinforcement learning (DRL)-based hypernetwork dismantling framework. Besides, we design a novel inductive hypernetwork embedding method to ensure the transferability to various real-world hypernetworks. Our framework first generates small-scale synthetic hypernetworks and embeds the nodes and hypernetworks into a low dimensional vector space to represent the action and state space in DRL, respectively. Then trial-and-error dismantling tasks are conducted by an agent on these synthetic hypernetworks, and the dismantling strategy is continuously optimized. Finally, the well-optimized strategy is applied to real-world hypernetwork dismantling tasks. Experimental results on five real-world hypernetworks demonstrate the effectiveness of our proposed framework.

preprint2022arXiv

Learning Topic Models: Identifiability and Finite-Sample Analysis

Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal theoretical investigation of the statistical identifiability and accuracy of latent topic estimation. In this paper, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood that is naturally connected to the concept, in computational geometry, of volume minimization. Our theory introduces a new set of geometric conditions for topic model identifiability, conditions that are weaker than conventional separability conditions, which typically rely on the existence of pure topic documents or of anchor words. Weaker conditions allow a wider and thus potentially more fruitful investigation. We conduct finite-sample error analysis for the proposed estimator and discuss connections between our results and those of previous investigations. We conclude with empirical studies employing both simulated and real datasets.

preprint2022arXiv

Mean-Field Nonparametric Estimation of Interacting Particle Systems

This paper concerns the nonparametric estimation problem of the distribution-state dependent drift vector field in an interacting $N$-particle system. Observing single-trajectory data for each particle, we derive the mean-field rate of convergence for the maximum likelihood estimator (MLE), which depends on both Gaussian complexity and Rademacher complexity of the function class. In particular, when the function class contains $α$-smooth H{ö}lder functions, our rate of convergence is minimax optimal on the order of $N^{-\fracα{d+2α}}$. Combining with a Fourier analytical deconvolution argument, we derive the consistency of MLE for the external force and interaction kernel in the McKean-Vlasov equation.

preprint2022arXiv

Minimax Rate of Distribution Estimation on Unknown Submanifold under Adversarial Losses

Statistical inference from high-dimensional data with low-dimensional structures has recently attracted lots of attention. In machine learning, deep generative modeling approaches implicitly estimate distributions of complex objects by creating new samples from the underlying distribution, and have achieved great success in generating synthetic realistic-looking images and texts. A key step in these approaches is the extraction of latent features or representations (encoding) that can be used for accurately reconstructing the original data (decoding). In other words, low-dimensional manifold structure is implicitly assumed and utilized in the distribution modeling and estimation. To understand the benefit of low-dimensional manifold structure in generative modeling, we build a general minimax framework for distribution estimation on unknown submanifold under adversarial losses, with suitable smoothness assumptions on the target distribution and the manifold. The established minimax rate elucidates how various problem characteristics, including intrinsic dimensionality of the data and smoothness levels of the target distribution and the manifold, affect the fundamental limit of high-dimensional distribution estimation. To prove the minimax upper bound, we construct an estimator based on a mixture of locally fitted generative models, which is motivated by the partition of unity technique from differential geometry and is necessary to cover cases where the underlying data manifold does not admit a global parametrization. We also propose a data-driven adaptive estimator that is shown to simultaneously attain within a logarithmic factor of the optimal rate over a large collection of distribution classes.

preprint2022arXiv

N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks

We present a novel mesh-based learning approach (N-Cloth) for plausible 3D cloth deformation prediction. Our approach is general and can handle cloth or obstacles represented by triangle meshes with arbitrary topologies. We use graph convolution to transform the cloth and object meshes into a latent space to reduce the non-linearity in the mesh space. Our network can predict the target 3D cloth mesh deformation based on the initial state of the cloth mesh template and the target obstacle mesh. Our approach can handle complex cloth meshes with up to 100K triangles and scenes with various objects corresponding to SMPL humans, non-SMPL humans or rigid bodies. In practice, our approach can be used to generate plausible cloth simulation at 30-45 fps on an NVIDIA GeForce RTX 3090 GPU. We highlight its benefits over prior learning-based methods and physically-based cloth simulators.

preprint2022arXiv

Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering

Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the $K$-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original $K$-means SDP with substantially reduced runtime.

preprint2021arXiv

Class Knowledge Overlay to Visual Feature Learning for Zero-Shot Image Classification

New categories can be discovered by transforming semantic features into synthesized visual features without corresponding training samples in zero-shot image classification. Although significant progress has been made in generating high-quality synthesized visual features using generative adversarial networks, guaranteeing semantic consistency between the semantic features and visual features remains very challenging. In this paper, we propose a novel zero-shot learning approach, GAN-CST, based on class knowledge to visual feature learning to tackle the problem. The approach consists of three parts, class knowledge overlay, semi-supervised learning and triplet loss. It applies class knowledge overlay (CKO) to obtain knowledge not only from the corresponding class but also from other classes that have the knowledge overlay. It ensures that the knowledge-to-visual learning process has adequate information to generate synthesized visual features. The approach also applies a semi-supervised learning process to re-train knowledge-to-visual model. It contributes to reinforcing synthesized visual features generation as well as new category prediction. We tabulate results on a number of benchmark datasets demonstrating that the proposed model delivers superior performance over state-of-the-art approaches.

preprint2021arXiv

Cross Knowledge-based Generative Zero-Shot Learning Approach with Taxonomy Regularization

Although zero-shot learning (ZSL) has an inferential capability of recognizing new classes that have never been seen before, it always faces two fundamental challenges of the cross modality and crossdomain challenges. In order to alleviate these problems, we develop a generative network-based ZSL approach equipped with the proposed Cross Knowledge Learning (CKL) scheme and Taxonomy Regularization (TR). In our approach, the semantic features are taken as inputs, and the output is the synthesized visual features generated from the corresponding semantic features. CKL enables more relevant semantic features to be trained for semantic-to-visual feature embedding in ZSL, while Taxonomy Regularization (TR) significantly improves the intersections with unseen images with more generalized visual features generated from generative network. Extensive experiments on several benchmark datasets (i.e., AwA1, AwA2, CUB, NAB and aPY) show that our approach is superior to these state-of-the-art methods in terms of ZSL image classification and retrieval.

preprint2021arXiv

Distributed Estimation for Principal Component Analysis: an Enlarged Eigenspace Analysis

The growing size of modern data sets brings many challenges to the existing statistical estimation approaches, which calls for new distributed methodologies. This paper studies distributed estimation for a fundamental statistical machine learning problem, principal component analysis (PCA). Despite the massive literature on top eigenvector estimation, much less is presented for the top-$L$-dim ($L>1$) eigenspace estimation, especially in a distributed manner. We propose a novel multi-round algorithm for constructing top-$L$-dim eigenspace for distributed data. Our algorithm takes advantage of shift-and-invert preconditioning and convex optimization. Our estimator is communication-efficient and achieves a fast convergence rate. In contrast to the existing divide-and-conquer algorithm, our approach has no restriction on the number of machines. Theoretically, the traditional Davis-Kahan theorem requires the explicit eigengap assumption to estimate the top-$L$-dim eigenspace. To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-$L$-dim eigenspace, we show that our estimator is able to cover the targeted top-$L$-dim population eigenspace. Our distributed algorithm can be applied to a wide range of statistical problems based on PCA, such as principal component regression and single index model. Finally, We provide simulation studies to demonstrate the performance of the proposed distributed estimator.

preprint2021arXiv

EdgeWorkflowReal: An Edge Computing based Workflow Execution Engine for Smart Systems

Current cloud-based smart systems suffer from weaknesses such as high response latency, limited network bandwidth and the restricted computing power of smart end devices which seriously affect the system's QoS (Quality of Service). Recently, given its advantages of low latency, high bandwidth and location awareness, edge computing has become a promising solution for smart systems. However, the development of edge computing based smart systems is a very challenging job for software developers who do not have the skills for the creation of edge computing environments. The management of edge computing resources and computing tasks is also very challenging. Workflow technology has been widely used in smart systems to automate task and resource management, but there does not yet exist a real-world deployable edge computing based workflow execution engine. To fill this gap, we present EdgeWorkflowReal, an edge computing based workflow execution engine for smart systems. EdgeWorkflowReal supports: 1) automatic creation of a real edge computing environment according to user settings; 2) visualized modelling of edge workflow applications; and 3) automatic deployment, monitoring and performance evaluation of edge workflow applications in a smart system.

preprint2021arXiv

Fast Statistical Leverage Score Approximation in Kernel Ridge Regression

Nyström approximation is a fast randomized method that rapidly solves kernel ridge regression (KRR) problems through sub-sampling the n-by-n empirical kernel matrix appearing in the objective function. However, the performance of such a sub-sampling method heavily relies on correctly estimating the statistical leverage scores for forming the sampling distribution, which can be as costly as solving the original KRR. In this work, we propose a linear time (modulo poly-log terms) algorithm to accurately approximate the statistical leverage scores in the stationary-kernel-based KRR with theoretical guarantees. Particularly, by analyzing the first-order condition of the KRR objective, we derive an analytic formula, which depends on both the input distribution and the spectral density of stationary kernels, for capturing the non-uniformity of the statistical leverage scores. Numerical experiments demonstrate that with the same prediction accuracy our method is orders of magnitude more efficient than existing methods in selecting the representative sub-samples in the Nyström approximation.

preprint2021arXiv

Multi-Knowledge Fusion for New Feature Generation in Generalized Zero-Shot Learning

Suffering from the semantic insufficiency and domain-shift problems, most of existing state-of-the-art methods fail to achieve satisfactory results for Zero-Shot Learning (ZSL). In order to alleviate these problems, we propose a novel generative ZSL method to learn more generalized features from multi-knowledge with continuously generated new semantics in semantic-to-visual embedding. In our approach, the proposed Multi-Knowledge Fusion Network (MKFNet) takes different semantic features from multi-knowledge as input, which enables more relevant semantic features to be trained for semantic-to-visual embedding, and finally generates more generalized visual features by adaptively fusing visual features from different knowledge domain. The proposed New Feature Generator (NFG) with adaptive genetic strategy is used to enrich semantic information on the one hand, and on the other hand it greatly improves the intersection of visual feature generated by MKFNet and unseen visual faetures. Empirically, we show that our approach can achieve significantly better performance compared to existing state-of-the-art methods on a large number of benchmarks for several ZSL tasks, including traditional ZSL, generalized ZSL and zero-shot retrieval.

preprint2020arXiv

Adversarial Camouflage: Hiding Physical-World Attacks with Natural Styles

Deep neural networks (DNNs) are known to be vulnerable to adversarial examples. Existing works have mostly focused on either digital adversarial examples created via small and imperceptible perturbations, or physical-world adversarial examples created with large and less realistic distortions that are easily identified by human observers. In this paper, we propose a novel approach, called Adversarial Camouflage (\emph{AdvCam}), to craft and camouflage physical-world adversarial examples into natural styles that appear legitimate to human observers. Specifically, \emph{AdvCam} transfers large adversarial perturbations into customized styles, which are then "hidden" on-target object or off-target background. Experimental evaluation shows that, in both digital and physical-world scenarios, adversarial examples crafted by \emph{AdvCam} are well camouflaged and highly stealthy, while remaining effective in fooling state-of-the-art DNN image classifiers. Hence, \emph{AdvCam} is a flexible approach that can help craft stealthy attacks to evaluate the robustness of DNNs. \emph{AdvCam} can also be used to protect private information from being detected by deep learning systems.

preprint2020arXiv

Diffusion $K$-means clustering on manifolds: provable exact recovery via semidefinite relaxations

We introduce the {\it diffusion $K$-means} clustering method on Riemannian submanifolds, which maximizes the within-cluster connectedness based on the diffusion distance. The diffusion $K$-means constructs a random walk on the similarity graph with vertices as data points randomly sampled on the manifolds and edges as similarities given by a kernel that captures the local geometry of manifolds. The diffusion $K$-means is a multi-scale clustering tool that is suitable for data with non-linear and non-Euclidean geometric features in mixed dimensions. Given the number of clusters, we propose a polynomial-time convex relaxation algorithm via the semidefinite programming (SDP) to solve the diffusion $K$-means. In addition, we also propose a nuclear norm regularized SDP that is adaptive to the number of clusters. In both cases, we show that exact recovery of the SDPs for diffusion $K$-means can be achieved under suitable between-cluster separability and within-cluster connectedness of the submanifolds, which together quantify the hardness of the manifold clustering problem. We further propose the {\it localized diffusion $K$-means} by using the local adaptive bandwidth estimated from the nearest neighbors. We show that exact recovery of the localized diffusion $K$-means is fully adaptive to the local probability density and geometric structures of the underlying submanifolds.

preprint2020arXiv

Entropy rigidity for 3D conservative Anosov flows and dispersing billiards

Given an integer $k \geq 5$, and a $C^k$ Anosov flow $Φ$ on some compact connected $3$-manifold preserving a smooth volume, we show that the measure of maximal entropy (MME) is the volume measure if and only if $Φ$ is $C^{k-\varepsilon}$-conjugate to an algebraic flow, for $\varepsilon>0$ arbitrarily small. Besides the rigidity, we also study the entropy flexibility, and show that the metric entropy with respect to the volume measure and the topological entropy of suspension flows over Anosov diffeomorphisms on the $2$-torus achieve all possible values subject to natural normalizations. Moreover, in the case of dispersing billiards, we show that if the measure of maximal entropy is the volume measure, then the Birkhoff Normal Form of regular periodic orbits with a homoclinic intersection is linear.

preprint2020arXiv

Hanson-Wright inequality in Hilbert spaces with application to $K$-means clustering for non-Euclidean data

We derive a dimension-free Hanson-Wright inequality for quadratic forms of independent sub-gaussian random variables in a separable Hilbert space. Our inequality is an infinite-dimensional generalization of the classical Hanson-Wright inequality for finite-dimensional Euclidean random vectors. We illustrate an application to the generalized $K$-means clustering problem for non-Euclidean data. Specifically, we establish the exponential rate of convergence for a semidefinite relaxation of the generalized $K$-means, which together with a simple rounding algorithm imply the exact recovery of the true clustering structure.

preprint2020arXiv

Hyperspectral Images Classification Based on Multi-scale Residual Network

Because hyperspectral remote sensing images contain a lot of redundant information and the data structure is highly non-linear, leading to low classification accuracy of traditional machine learning methods. The latest research shows that hyperspectral image classification based on deep convolutional neural network has high accuracy. However, when a small amount of data is used for training, the classification accuracy of deep learning methods is greatly reduced. In order to solve the problem of low classification accuracy of existing algorithms on small samples of hyperspectral images, a multi-scale residual network is proposed. The multi-scale extraction and fusion of spatial and spectral features is realized by adding a branch structure into the residual block and using convolution kernels of different sizes in the branch. The spatial and spectral information contained in hyperspectral images are fully utilized to improve the classification accuracy. In addition, in order to improve the speed and prevent overfitting, the model uses dynamic learning rate, BN and Dropout strategies. The experimental results show that the overall classification accuracy of this method is 99.07% and 99.96% respectively in the data set of Indian Pines and Pavia University, which is better than other algorithms.

preprint2020arXiv

MFL_COVID19: Quantifying Country-based Factors affecting Case Fatality Rate in Early Phase of COVID-19 Epidemic via Regularised Multi-task Feature Learning

Recent outbreak of COVID-19 has led a rapid global spread around the world. Many countries have implemented timely intensive suppression to minimize the infections, but resulted in high case fatality rate (CFR) due to critical demand of health resources. Other country-based factors such as sociocultural issues, ageing population etc., has also influenced practical effectiveness of taking interventions to improve morality in early phase. To better understand the relationship of these factors across different countries with COVID-19 CFR is of primary importance to prepare for potentially second wave of COVID-19 infections. In the paper, we propose a novel regularized multi-task learning based factor analysis approach for quantifying country-based factors affecting CFR in early phase of COVID-19 epidemic. We formulate the prediction of CFR progression as a ML regression problem with observed CFR and other countries-based factors. In this formulation, all CFR related factors were categorized into 6 sectors with 27 indicators. We proposed a hybrid feature selection method combining filter, wrapper and tree-based models to calibrate initial factors for a preliminary feature interaction. Then we adopted two typical single task model (Ridge and Lasso regression) and one state-of-the-art MTFL method (fused sparse group lasso) in our formulation. The fused sparse group Lasso (FSGL) method allows the simultaneous selection of a common set of country-based factors for multiple time points of COVID-19 epidemic and also enables incorporating temporal smoothness of each factor over the whole early phase period. Finally, we proposed one novel temporal voting feature selection scheme to balance the weight instability of multiple factors in our MTFL model.

preprint2016arXiv

A Cost-Effective Strategy for Storing Scientific Datasets with Multiple Service Providers in the Cloud

Cloud computing provides scientists a platform that can deploy computation and data intensive applications without infrastructure investment. With excessive cloud resources and a decision support system, large generated data sets can be flexibly 1 stored locally in the current cloud, 2 deleted and regenerated whenever reused or 3 transferred to cheaper cloud service for storage. However, due to the pay for use model, the total application cost largely depends on the usage of computation, storage and bandwidth resources, hence cutting the cost of cloud based data storage becomes a big concern for deploying scientific applications in the cloud. In this paper, we propose a novel strategy that can cost effectively store large generated data sets with multiple cloud service providers. The strategy is based on a novel algorithm that finds the trade off among computation, storage and bandwidth costs in the cloud, which are three key factors for the cost of data storage. Both general (random) simulations conducted with popular cloud service providers pricing models and three specific case studies on real world scientific applications show that the proposed storage strategy is highly cost effective and practical for run time utilization in the cloud.

preprint2016arXiv

Bayesian fractional posteriors

We consider the fractional posterior distribution that is obtained by updating a prior distribution via Bayes theorem with a fractional likelihood function, a usual likelihood function raised to a fractional power. First, we analyze the contraction property of the fractional posterior in a general misspecified framework. Our contraction results only require a prior mass condition on certain Kullback-Leibler (KL) neighborhood of the true parameter (or the KL divergence minimizer in the misspecified case), and obviate constructions of test functions and sieves commonly used in the literature for analyzing the contraction property of a regular posterior. We show through a counterexample that some condition controlling the complexity of the parameter space is necessary for the regular posterior to contract, rendering additional flexibility on the choice of the prior for the fractional posterior. Second, we derive a novel Bayesian oracle inequality based on a PAC-Bayes inequality in misspecified models. Our derivation reveals several advantages of averaging based Bayesian procedures over optimization based frequentist procedures. As an application of the Bayesian oracle inequality, we derive a sharp oracle inequality in the convex regression problem under an arbitrary dimension. We also illustrate the theory in Gaussian process regression and density estimation problems.

preprint2016arXiv

Communication-Efficient Distributed Statistical Inference

We present a Communication-efficient Surrogate Likelihood (CSL) framework for solving distributed statistical inference problems. CSL provides a communication-efficient surrogate to the global likelihood that can be used for low-dimensional estimation, high-dimensional regularized estimation and Bayesian inference. For low-dimensional estimation, CSL provably improves upon naive averaging schemes and facilitates the construction of confidence intervals. For high-dimensional regularized estimation, CSL leads to a minimax-optimal estimator with controlled communication cost. For Bayesian inference, CSL can be used to form a communication-efficient quasi-posterior distribution that converges to the true posterior. This quasi-posterior procedure significantly improves the computational efficiency of MCMC algorithms even in a non-distributed setting. We present both theoretical analysis and experiments to explore the properties of the CSL approximation.

preprint2016arXiv

Moving frame and integrable system of the discrete centroaffine curves in R^3

Any two equivalent discrete curves must have the same invariants at the corresponding points under an affine transformation. In this paper, we construct the moving frame and invariants for the discrete centroaffine curves, which could be used to discriminate the same discrete curves from different graphics, and estimate whether a polygon flow is stable or periodically stable. In fact, using the similar method as the Frenet-Serret frame, a discrete curve can be uniquely identified by its centroaffine curvatures and torsions. In 1878, Darboux studied the problem of midpoint iteration of polygons[12]. Berlekamp et al studied this problem in detail[2]. Now, through the centroaffine curvatures and torsions, the iteration process can be clearly quantified. Exactly, we describe the whole iteration process by using centroaffine curvatures and torsions, and its periodicity could be directly exhibited. As an application, we would obtain some stable discrete space curves with changeless curvatures and torsions after multistep iteration. For the pentagram map of a polygon, the affinely regular polygons are stable. Furthermore, we find the convex hexagons with parallel and equi-length opposite sides are periodically stable, and some convex parallel and equi-length opposite sides octagons are also periodically stable. The proofs of these results are obtained using the structure equations of the discrete cnetroaffine curves and the integrable conditions of its flows.

preprint2016arXiv

The discrete centroaffine indefinite surface

In this paper we build the structure equations and the integrable systems for a discrete centroaffine indefinite surface in $\R^3$. At the same time, some centroaffine invariants are obtained according to the structure equations. Using these centroaffine invariants, we study the Laplacian operator and the convexity of a discrete centroaffine indefinite surface. Furthermore, some interest examples are provided.

preprint2016arXiv

The moving frame on the fractal curves

Using the moving frame and invariants, any discrete curve in $\R^3$ could be uniquely identified by its centroaffine curvatures and torsions. In this paper, depending on the affine curvatures of the fractal curves, such as Koch curve and Hilbert curve, we can clearly describe their iterative regularities. Interestingly, by the affine curvatures, the fractal curves can be quantified and encoded accordingly to a sequence. Hence, it is more convenient for future reference. Given three starting points, we can directly generate the affine Koch curve and affine Hilbert curve at the step $n, \forall n\in \mathbb{Z}^+$. Certainly, if the initial three points are standard, the curve is the traditional Koch curve or Hilbert curve. By this method, the characteristic of some fractal curves which look like irregular could be quantified, and the regularities would become more obvious.

preprint2015arXiv

Hyperbolic periodic points for chain hyperbolic homoclinic classes

In this paper we establish a closing property and a hyperbolic closing property for thin trapped chain hyperbolic homoclinic classes with one dimensional center in partial hyperbolicity setting. Taking advantage of theses properties, we prove that the growth rate of the number of hyperbolic periodic points is equal to the topological entropy. We also obtain that the hyperbolic periodic measures are dense in the space of invariant measures.

preprint2015arXiv

Joint estimation of quantile planes over arbitrary predictor spaces

In spite of the recent surge of interest in quantile regression, joint estimation of linear quantile planes remains a great challenge in statistics and econometrics. We propose a novel parametrization that characterizes any collection of non-crossing quantile planes over arbitrarily shaped convex predictor domains in any dimension by means of unconstrained scalar, vector and function valued parameters. Statistical models based on this parametrization inherit a fast computation of the likelihood function, enabling penalized likelihood or Bayesian approaches to model fitting. We introduce a complete Bayesian methodology by using Gaussian process prior distributions on the function valued parameters and develop a robust and efficient Markov chain Monte Carlo parameter estimation. The resulting method is shown to offer posterior consistency under mild tail and regularity conditions. We present several illustrative examples where the new method is compared against existing approaches and is found to offer better accuracy, coverage and model fit.

preprint2015arXiv

Minimax-optimal nonparametric regression in high dimensions

Minimax $L_2$ risks for high-dimensional nonparametric regression are derived under two sparsity assumptions: (1) the true regression surface is a sparse function that depends only on $d=O(\log n)$ important predictors among a list of $p$ predictors, with $\log p=o(n)$; (2) the true regression surface depends on $O(n)$ predictors but is an additive function where each additive component is sparse but may contain two or more interacting predictors and may have a smoothness level different from other components. For either modeling assumption, a practicable extension of the widely used Bayesian Gaussian process regression method is shown to adaptively attain the optimal minimax rate (up to $\log n$ terms) asymptotically as both $n,p\to\infty$ with $\log p=o(n)$.

preprint2015arXiv

On the Computational Complexity of High-Dimensional Bayesian Variable Selection

We study the computational complexity of Markov chain Monte Carlo (MCMC) methods for high-dimensional Bayesian linear regression under sparsity constraints. We first show that a Bayesian approach can achieve variable-selection consistency under relatively mild conditions on the design matrix. We then demonstrate that the statistical criterion of posterior concentration need not imply the computational desideratum of rapid mixing of the MCMC algorithm. By introducing a truncated sparsity prior for variable selection, we provide a set of conditions that guarantee both variable-selection consistency and rapid mixing of a particular Metropolis-Hastings algorithm. The mixing time is linear in the number of covariates up to a logarithmic factor. Our proof controls the spectral gap of the Markov chain by constructing a canonical path ensemble that is inspired by the steps taken by greedy algorithms for variable selection.

preprint2015arXiv

On the last fall degree of zero-dimensional Weil descent systems

In this article we will discuss a new, mostly theoretical, method for solving (zero-dimensional) polynomial systems, which lies in between Gröbner basis computations and the heuristic first fall degree assumption and is not based on any heuristic. This method relies on the new concept of last fall degree. Let $k$ be a finite field of cardinality $q^n$ and let $k'$ be its subfield of cardinality $q$. Let $\mathcal{F} \subset k[X_0,\ldots,X_{m-1}]$ be a finite subset generating a zero-dimensional ideal. We give an upper bound of the last fall degree of the Weil descent system of $\mathcal{F}$, which depends on $q$, $m$, the last fall degree of $\mathcal{F}$, the degree of $\mathcal{F}$ and the number of solutions of $\mathcal{F}$, but not on $n$. This shows that such Weil descent systems can be solved efficiently if $n$ grows. In particular, we apply these results for multi-HFE and essentially show that multi-HFE is insecure. Finally, we discuss that the degree of regularity (or last fall degree) of Weil descent systems coming from summation polynomials to solve the elliptic curve discrete logarithm problem might depend on $n$, since such systems without field equations are not zero-dimensional.

preprint2015arXiv

Randomized sketches for kernels: Fast and optimal non-parametric regression

Kernel ridge regression (KRR) is a standard method for performing non-parametric regression over reproducing kernel Hilbert spaces. Given $n$ samples, the time and space complexity of computing the KRR estimate scale as $\mathcal{O}(n^3)$ and $\mathcal{O}(n^2)$ respectively, and so is prohibitive in many cases. We propose approximations of KRR based on $m$-dimensional randomized sketches of the kernel matrix, and study how small the projection dimension $m$ can be chosen while still preserving minimax optimality of the approximate KRR estimate. For various classes of randomized sketches, including those based on Gaussian and randomized Hadamard matrices, we prove that it suffices to choose the sketch dimension $m$ proportional to the statistical dimension (modulo logarithmic factors). Thus, we obtain fast and minimax optimal approximations to the KRR estimate for non-parametric regression.

preprint2015arXiv

Semiparametric Bernstein-von Mises Theorem: Second Order Studies

The major goal of this paper is to study the second order frequentist properties of the marginal posterior distribution of the parametric component in semiparametric Bayesian models, in particular, a second order semiparametric Bernstein-von Mises (BvM) Theorem. Our first contribution is to discover an interesting interference phenomenon between Bayesian estimation and frequentist inferential accuracy: more accurate Bayesian estimation on the nuisance function leads to higher frequentist inferential accuracy on the parametric component. As the second contribution, we propose a new class of dependent priors under which Bayesian inference procedures for the parametric component are not only efficient but also adaptive (w.r.t. the smoothness of nonparametric component) up to the second order frequentist validity. However, commonly used independent priors may even fail to produce a desirable root-n contraction rate for the parametric component in this adaptive case unless some stringent assumption is imposed. Three important classes of semiparametric models are examined, and extensive simulations are also provided.

preprint2015arXiv

The $C^1$ density of nonuniform hyperbolicity in $C^{ r}$ conservative diffeomorphisms

Let $\Diff^{ r}_m(M)$ be the set of $C^{ r}$ volume-preserving diffeomorphisms on a compact Riemannian manifold $M$ ($\dim M\geq 2$). In this paper, we prove that the diffeomorphisms without zero Lyapunov exponents on a set of positive volume are $C^1$ dense in $\Diff^{ r}_m(M), r\geq 1$. We also prove a weaker result for symplectic diffeomorphisms $\mathcal{S}ym^{r}_ω(M), r\geq1 $ saying that the symplectic diffeomorphisms with non-zero Lyapunov exponents on a set of positive volume are $C^1$ dense in $\mathcal{S}ym^{r}_ω(M), r\geq1 $.

preprint2014arXiv

Bayesian Manifold Regression

There is increasing interest in the problem of nonparametric regression with high-dimensional predictors. When the number of predictors $D$ is large, one encounters a daunting problem in attempting to estimate a $D$-dimensional surface based on limited data. Fortunately, in many applications, the support of the data is concentrated on a $d$-dimensional subspace with $d \ll D$. Manifold learning attempts to estimate this subspace. Our focus is on developing computationally tractable and theoretically supported Bayesian nonparametric regression methods in this context. When the subspace corresponds to a locally-Euclidean compact Riemannian manifold, we show that a Gaussian process regression approach can be applied that leads to the minimax optimal adaptive rate in estimating the regression function under some conditions. The proposed model bypasses the need to estimate the manifold, and can be implemented using standard algorithms for posterior computation in Gaussian processes. Finite sample performance is illustrated in an example data analysis.

preprint2014arXiv

Horseshoes for $\mathcal{C}^{1+α}$ mappings with hyperbolic measures

We present here a construction of horseshoes for any $\mathcal{C}^{1+α}$ mapping $f$ preserving an ergodic hyperbolic measure $μ$ with $h_μ(f)>0$ and then deduce that the exponential growth rate of the number of periodic points for any $\mathcal{C}^{1+α}$ mapping $f$ is greater than or equal to $h_μ(f)$. We also prove that the exponential growth rate of the number of hyperbolic periodic points is equal to the hyperbolic entropy. The hyperbolic entropy means the entropy resulting from hyperbolic measures.

preprint2014arXiv

Livšic Measurable Rigidity Theorem for \mathcal{C}^1 Generic Volume-preserving Systems

In this paper, we prove that for $\mathcal{C}^1$ generic volume-preserving Anosov diffeomorphisms of a compact Riemannian manifold, Livšic measurable rigidity theorem holds. We also prove that for $\mathcal{C}^1$ generic volume-preserving Anosov flows of a compact Riemannian manifold, Livšic measurable rigidity theorem holds.

preprint2014arXiv

Minimax Optimal Bayesian Aggregation

It is generally believed that ensemble approaches, which combine multiple algorithms or models, can outperform any single algorithm at machine learning tasks, such as prediction. In this paper, we propose Bayesian convex and linear aggregation approaches motivated by regression applications. We show that the proposed approach is minimax optimal when the true data-generating model is a convex or linear combination of models in the list. Moreover, the method can adapt to sparsity structure in which certain models should receive zero weights, and the method is tuning parameter free unlike competitors. More generally, under an M-open view when the truth falls outside the space of all convex/linear combinations, our theory suggests that the posterior measure tends to concentrate on the best approximation of the truth at the minimax rate. We illustrate the method through simulation studies and several applications.

preprint2013arXiv

Bayesian Conditional Tensor Factorizations for High-Dimensional Classification

In many application areas, data are collected on a categorical response and high-dimensional categorical predictors, with the goals being to build a parsimonious model for classification while doing inferences on the important predictors. In settings such as genomics, there can be complex interactions among the predictors. By using a carefully-structured Tucker factorization, we define a model that can characterize any conditional probability, while facilitating variable selection and modeling of higher-order interactions. Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm for posterior computation accommodating uncertainty in the predictors to be included. Under near sparsity assumptions, the posterior distribution for the conditional probability is shown to achieve close to the parametric rate of contraction even in ultra high-dimensional settings. The methods are illustrated using simulation examples and biomedical applications.

preprint2013arXiv

Bayesian crack detection in ultra high resolution multimodal images of paintings

The preservation of our cultural heritage is of paramount importance. Thanks to recent developments in digital acquisition techniques, powerful image analysis algorithms are developed which can be useful non-invasive tools to assist in the restoration and preservation of art. In this paper we propose a semi-supervised crack detection method that can be used for high-dimensional acquisitions of paintings coming from different modalities. Our dataset consists of a recently acquired collection of images of the Ghent Altarpiece (1432), one of Northern Europe's most important art masterpieces. Our goal is to build a classifier that is able to discern crack pixels from the background consisting of non-crack pixels, making optimal use of the information that is provided by each modality. To accomplish this we employ a recently developed non-parametric Bayesian classifier, that uses tensor factorizations to characterize any conditional probability. A prior is placed on the parameters of the factorization such that every possible interaction between predictors is allowed while still identifying a sparse subset among these predictors. The proposed Bayesian classifier, which we will refer to as conditional Bayesian tensor factorization or CBTF, is assessed by visually comparing classification results with the Random Forest (RF) algorithm.

preprint2013arXiv

Sequential Markov Chain Monte Carlo

We propose a sequential Markov chain Monte Carlo (SMCMC) algorithm to sample from a sequence of probability distributions, corresponding to posterior distributions at different times in on-line applications. SMCMC proceeds as in usual MCMC but with the stationary distribution updated appropriately each time new data arrive. SMCMC has advantages over sequential Monte Carlo (SMC) in avoiding particle degeneracy issues. We provide theoretical guarantees for the marginal convergence of SMCMC under various settings, including parametric and nonparametric models. The proposed approach is compared to competitors in a simulation study. We also consider an application to on-line nonparametric regression.

Yun Yang

What is connected

Connect this record

See the researcher in context

Building this map preview

45 published item(s)

Accumulation of Sub-Sampling Matrices with Applications to Statistical Computation

Toward Scalable Terminal Task Synthesis via Skill Graphs

A Practical Beamforming Design for Active RIS-assisted MU-MISO Systems

Estimating Distributions with Low-dimensional Structures Using Mixtures of Generative Models

Cost-effective Land Cover Classification for Remote Sensing Images

High-Dimensional Linear Regression via Implicit Regularization

Hypernetwork Dismantling via Deep Reinforcement Learning

Learning Topic Models: Identifiability and Finite-Sample Analysis

Mean-Field Nonparametric Estimation of Interacting Particle Systems

Minimax Rate of Distribution Estimation on Unknown Submanifold under Adversarial Losses

N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks

Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering

Class Knowledge Overlay to Visual Feature Learning for Zero-Shot Image Classification

Cross Knowledge-based Generative Zero-Shot Learning Approach with Taxonomy Regularization

Distributed Estimation for Principal Component Analysis: an Enlarged Eigenspace Analysis

EdgeWorkflowReal: An Edge Computing based Workflow Execution Engine for Smart Systems

Fast Statistical Leverage Score Approximation in Kernel Ridge Regression

Multi-Knowledge Fusion for New Feature Generation in Generalized Zero-Shot Learning

Adversarial Camouflage: Hiding Physical-World Attacks with Natural Styles

Diffusion $K$-means clustering on manifolds: provable exact recovery via semidefinite relaxations

Entropy rigidity for 3D conservative Anosov flows and dispersing billiards

Hanson-Wright inequality in Hilbert spaces with application to $K$-means clustering for non-Euclidean data

Hyperspectral Images Classification Based on Multi-scale Residual Network

MFL_COVID19: Quantifying Country-based Factors affecting Case Fatality Rate in Early Phase of COVID-19 Epidemic via Regularised Multi-task Feature Learning

A Cost-Effective Strategy for Storing Scientific Datasets with Multiple Service Providers in the Cloud

Bayesian fractional posteriors

Communication-Efficient Distributed Statistical Inference

Moving frame and integrable system of the discrete centroaffine curves in R^3

The discrete centroaffine indefinite surface

The moving frame on the fractal curves

Hyperbolic periodic points for chain hyperbolic homoclinic classes

Joint estimation of quantile planes over arbitrary predictor spaces

Minimax-optimal nonparametric regression in high dimensions

On the Computational Complexity of High-Dimensional Bayesian Variable Selection

On the last fall degree of zero-dimensional Weil descent systems

Randomized sketches for kernels: Fast and optimal non-parametric regression

Semiparametric Bernstein-von Mises Theorem: Second Order Studies

The $C^1$ density of nonuniform hyperbolicity in $C^{ r}$ conservative diffeomorphisms

Bayesian Manifold Regression

Horseshoes for $\mathcal{C}^{1+α}$ mappings with hyperbolic measures

Livšic Measurable Rigidity Theorem for \mathcal{C}^1 Generic Volume-preserving Systems

Minimax Optimal Bayesian Aggregation

Bayesian Conditional Tensor Factorizations for High-Dimensional Classification

Bayesian crack detection in ultra high resolution multimodal images of paintings

Sequential Markov Chain Monte Carlo