Source author record

Xiuyuan Cheng

Xiuyuan Cheng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.ST Statistics Theory Computer Vision math.NA Numerical Analysis eess.IV Information Theory math-ph math.AP math.FA math.IT math.MP math.OC math.PR math.SP

Catalog footprint

What is connected

18works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Point processes with event time uncertainty

Point processes are widely used statistical models for continuous-time discrete event data, such as medical records, crime reports, and social network interactions, to capture the influence of historical events on future occurrences. In many applications, however, event times are not observed exactly, motivating the need to incorporate time uncertainty into point process modeling. In this work, we introduce a framework for modeling time-uncertain self-exciting point processes, known as Hawkes processes, possibly defined over a network. We begin by formulating the model in continuous time under assumptions motivated by real-world scenarios. By imposing a time grid, we obtain a discrete-time model that facilitates inference and enables computation via first-order optimization methods such as gradient descent and variational inequality (VI). We establish a parameter recovery guarantee for VI inference with an $O(1/k)$ convergence rate using $k$ steps. Our framework accommodates non-stationary processes by representing the influence kernel as a matrix (or tensor on a network), while also encompassing stationary processes, such as the classical Hawkes process, as a special case. Empirically, we demonstrate that the proposed approach outperforms existing baselines on both simulated and real-world datasets, including the sepsis-associated derangement prediction challenge and the Atlanta Police Crime Dataset.

preprint2022arXiv

Neural Spectral Marked Point Processes

Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.

preprint2022arXiv

Scaling-Translation-Equivariant Networks with Decomposed Convolutional Filters

Encoding the scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many computer vision tasks especially when dealing with multiscale inputs. We study, in this paper, a scaling-translation-equivariant (ST-equivariant) CNN with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve equivariance for the regular representation of the scaling-translation group ST . To reduce the model complexity and computational burden, we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation, a property which is theoretically analyzed and empirically verified. Numerical experiments demonstrate that the proposed scaling-translation-equivariant network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size.

preprint2021arXiv

Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples

The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $p$-Wasserstein distance in $d$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.

preprint2021arXiv

Convergence of Graph Laplacian with kNN Self-tuned Kernels

Kernelized Gram matrix $W$ constructed from data points $\{x_i\}_{i=1}^N$ as $W_{ij}= k_0( \frac{ \| x_i - x_j \|^2} {σ^2} )$ is widely used in graph-based geometric data analysis and unsupervised learning. An important question is how to choose the kernel bandwidth $σ$, and a common practice called self-tuned kernel adaptively sets a $σ_i$ at each point $x_i$ by the $k$-nearest neighbor (kNN) distance. When $x_i$'s are sampled from a $d$-dimensional manifold embedded in a possibly high-dimensional space, unlike with fixed-bandwidth kernels, theoretical results of graph Laplacian convergence with self-tuned kernels have been incomplete. This paper proves the convergence of graph Laplacian operator $L_N$ to manifold (weighted-)Laplacian for a new family of kNN self-tuned kernels $W^{(α)}_{ij} = k_0( \frac{ \| x_i - x_j \|^2}{ ε\hatρ(x_i) \hatρ(x_j)})/\hatρ(x_i)^α\hatρ(x_j)^α$, where $\hatρ$ is the estimated bandwidth function {by kNN}, and the limiting operator is also parametrized by $α$. When $α= 1$, the limiting operator is the weighted manifold Laplacian $Δ_p$. Specifically, we prove the point-wise convergence of $L_N f $ and convergence of the graph Dirichlet form with rates. Our analysis is based on first establishing a $C^0$ consistency for $\hatρ$ which bounds the relative estimation error $|\hatρ - \barρ|/\barρ$ uniformly with high probability, where $\barρ = p^{-1/d}$, and $p$ is the data density function. Our theoretical results reveal the advantage of self-tuned kernel over fixed-bandwidth kernel via smaller variance error in low-density regions. In the algorithm, no prior knowledge of $d$ or data density is needed. The theoretical results are supported by numerical experiments on simulated data and hand-written digit image data.

preprint2020arXiv

ACDC: Weight Sharing in Atom-Coefficient Decomposed Convolution

Convolutional Neural Networks (CNNs) are known to be significantly over-parametrized, and difficult to interpret, train and adapt. In this paper, we introduce a structural regularization across convolutional kernels in a CNN. In our approach, each convolution kernel is first decomposed as 2D dictionary atoms linearly combined by coefficients. The widely observed correlation and redundancy in a CNN hint a common low-rank structure among the decomposed coefficients, which is here further supported by our empirical observations. We then explicitly regularize CNN kernels by enforcing decomposed coefficients to be shared across sub-structures, while leaving each sub-structure only its own dictionary atoms, a few hundreds of parameters typically, which leads to dramatic model reductions. We explore models with sharing across different sub-structures to cover a wide range of trade-offs between parameter reduction and expressiveness. Our proposed regularized network structures open the door to better interpreting, training and adapting deep models. We validate the flexibility and compatibility of our method by image classification experiments on multiple datasets and underlying network structures, and show that CNNs now maintain performance with dramatic reduction in parameters and computations, e.g., only 5\% parameters are used in a ResNet-18 to achieve comparable performance. Further experiments on few-shot classification show that faster and more robust task adaptation is obtained in comparison with models with standard convolutions.

preprint2020arXiv

Butterfly-Net: Optimal Function Representation Based on Convolutional Neural Networks

Deep networks, especially convolutional neural networks (CNNs), have been successfully applied in various areas of machine learning as well as to challenging problems in other scientific and engineering fields. This paper introduces Butterfly-Net, a low-complexity CNN with structured and sparse cross-channel connections, together with a Butterfly initialization strategy for a family of networks. Theoretical analysis of the approximation power of Butterfly-Net to the Fourier representation of input data shows that the error decays exponentially as the depth increases. Combining Butterfly-Net with a fully connected neural network, a large class of problems are proved to be well approximated with network complexity depending on the effective frequency bandwidth instead of the input dimension. Regular CNN is covered as a special case in our analysis. Numerical experiments validate the analytical results on the approximation of Fourier kernels and energy functionals of Poisson's equations. Moreover, all experiments support that training from Butterfly initialization outperforms training from random initialization. Also, adding the remaining cross-channel connections, although significantly increase the parameter number, does not much improve the post-training accuracy and is more sensitive to data distribution.

preprint2020arXiv

Butterfly-Net2: Simplified Butterfly-Net and Fourier Transform Initialization

Structured CNN designed using the prior information of problems potentially improves efficiency over conventional CNNs in various tasks in solving PDEs and inverse problems in signal processing. This paper introduces BNet2, a simplified Butterfly-Net and inline with the conventional CNN. Moreover, a Fourier transform initialization is proposed for both BNet2 and CNN with guaranteed approximation power to represent the Fourier transform operator. Experimentally, BNet2 and the Fourier transform initialization strategy are tested on various tasks, including approximating Fourier transform operator, end-to-end solvers of linear and nonlinear PDEs, and denoising and deblurring of 1D signals. On all tasks, under the same initialization, BNet2 achieves similar accuracy as CNN but has fewer parameters. And Fourier transform initialized BNet2 and CNN consistently improve the training and testing accuracy over the randomly initialized CNN.

preprint2020arXiv

Classification Logit Two-sample Testing by Neural Networks

The recent success of generative adversarial networks and variational learning suggests training a classifier network may work well in addressing the classical two-sample problem. Network-based tests have the computational advantage that the algorithm scales to large samples. This paper proposes a two-sample statistic which is the difference of the logit function, provided by a trained classification neural network, evaluated on the testing set split of the two datasets. Theoretically, we prove the testing power to differentiate two sub-exponential densities given that the network is sufficiently parametrized. When the two densities lie on or near to low-dimensional manifolds embedded in possibly high-dimensional space, the needed network complexity is reduced to only scale with the intrinsic dimensionality. Both the approximation and estimation error analysis are based on a new result of near-manifold integral approximation. In experiments, the proposed method demonstrates better performance than previous network-based tests using classification accuracy as the two-sample statistic, and compares favorably to certain kernel maximum mean discrepancy tests on synthetic datasets and hand-written digit datasets.

preprint2020arXiv

On Matrix Rearrangement Inequalities

Given two symmetric and positive semidefinite square matrices $A, B$, is it true that any matrix given as the product of $m$ copies of $A$ and $n$ copies of $B$ in a particular sequence must be dominated in the spectral norm by the ordered matrix product $A^m B^n$? For example, is $$ \| AABAABABB \| \leq \| AAAAABBBB \|\ ? $$ Drury has characterized precisely which disordered words have the property that an inequality of this type holds for all matrices $A,B$. However, the $1$-parameter family of counterexamples Drury constructs for these characterizations is comprised of $3 \times 3$ matrices, and thus as stated the characterization applies only for $N \times N$ matrices with $N \geq 3$. In contrast, we prove that for $2 \times 2$ matrices, the general rearrangement inequality holds for all disordered words. We also show that for larger $N \times N$ matrices, the general rearrangement inequality holds for all disordered words, for most $A,B$ (in a sense of full measure) that are sufficiently small perturbations of the identity.

preprint2020arXiv

Stochastic Conditional Generative Networks with Basis Decomposition

While generative adversarial networks (GANs) have revolutionized machine learning, a number of open questions remain to fully understand them and exploit their power. One of these questions is how to efficiently achieve proper diversity and sampling of the multi-mode data space. To address this, we introduce BasisGAN, a stochastic conditional multi-mode image generator. By exploiting the observation that a convolutional filter can be well approximated as a linear combination of a small set of basis elements, we learn a plug-and-played basis generator to stochastically generate basis elements, with just a few hundred of parameters, to fully embed stochasticity into convolutional filters. By sampling basis elements instead of filters, we dramatically reduce the cost of modeling the parameter space with no sacrifice on either image diversity or fidelity. To illustrate this proposed plug-and-play framework, we construct variants of BasisGAN based on state-of-the-art conditional image generation networks, and train the networks by simply plugging in a basis generator, without additional auxiliary components, hyperparameters, or training objectives. The experimental success is complemented with theoretical results indicating how the perturbations introduced by the proposed sampling of basis elements can propagate to the appearance of generated images.

preprint2020arXiv

Variational Diffusion Autoencoders with Random Walk Sampling

Variational autoencoders (VAEs) and generative adversarial networks (GANs) enjoy an intuitive connection to manifold learning: in training the decoder/generator is optimized to approximate a homeomorphism between the data distribution and the sampling space. This is a construction that strives to define the data manifold. A major obstacle to VAEs and GANs, however, is choosing a suitable prior that matches the data topology. Well-known consequences of poorly picked priors are posterior and mode collapse. To our knowledge, no existing method sidesteps this user choice. Conversely, $\textit{diffusion maps}$ automatically infer the data topology and enjoy a rigorous connection to manifold learning, but do not scale easily or provide the inverse homeomorphism (i.e. decoder/generator). We propose a method that combines these approaches into a generative model that inherits the asymptotic guarantees of $\textit{diffusion maps}$ while preserving the scalability of deep models. We prove approximation theoretic results for the dimension dependence of our proposed method. Finally, we demonstrate the effectiveness of our method with various real and synthetic datasets.

preprint2016arXiv

Marchenko-Pastur Law for Tyler's M-estimator

This paper studies the limiting behavior of Tyler's M-estimator for the scatter matrix, in the regime that the number of samples $n$ and their dimension $p$ both go to infinity, and $p/n$ converges to a constant $y$ with $0<y<1$. We prove that when the data samples $x_1, \ldots, x_n$ are identically and independently generated from the Gaussian distribution $\mathcal{N}(0, I)$, the operator norm of the difference between a properly scaled Tyler's M-estimator and $\sum_{i=1}^n x_i x_i^\top/n$ tends to zero. As a result, the spectral distribution of Tyler's M-estimator converges weakly to the Marčenko-Pastur distribution.

preprint2016arXiv

On the Diffusion Geometry of Graph Laplacians and Applications

We study directed, weighted graphs $G=(V,E)$ and consider the (not necessarily symmetric) averaging operator $$ (\mathcal{L}u)(i) = -\sum_{j \sim_{} i}{p_{ij} (u(j) - u(i))},$$ where $p_{ij}$ are normalized edge weights. Given a vertex $i \in V$, we define the diffusion distance to a set $B \subset V$ as the smallest number of steps $d_{B}(i) \in \mathbb{N}$ required for half of all random walks started in $i$ and moving randomly with respect to the weights $p_{ij}$ to visit $B$ within $d_{B}(i)$ steps. Our main result is that the eigenfunctions interact nicely with this notion of distance. In particular, if $u$ satisfies $\mathcal{L}u = λu$ on $V$ and $$ B = \left\{ i \in V: - \varepsilon \leq u(i) \leq \varepsilon \right\} \neq \emptyset,$$ then, for all $i \in V$, $$ d_{B}(i) \log{\left( \frac{1}{|1-λ|} \right) } \geq \log{\left( \frac{ |u(i)| }{\|u\|_{L^{\infty}}} \right)} - \log{\left(\frac{1}{2} + \varepsilon\right)}.$$ $d_B(i)$ is a remarkably good approximation of $|u|$ in the sense of having very high correlation. The result implies that the classical one-dimensional spectral embedding preserves particular aspects of geometry in the presence of clustered data. We also give a continuous variant of the result which has a connection to the hot spots conjecture.

preprint2015arXiv

Deep Haar Scattering Networks

An orthogonal Haar scattering transform is a deep network, computed with a hierarchy of additions, subtractions and absolute values, over pairs of coefficients. It provides a simple mathematical model for unsupervised deep network learning. It implements non-linear contractions, which are optimized for classification, with an unsupervised pair matching algorithm, of polynomial complexity. A structured Haar scattering over graph data computes permutation invariant representations of groups of connected points in the graph. If the graph connectivity is unknown, unsupervised Haar pair learning can provide a consistent estimation of connected dyadic groups of points. Classification results are given on image data bases, defined on regular grids or graphs, with a connectivity which may be known or unknown.

preprint2014arXiv

Concentration of the Kirchhoff index for Erdos-Renyi graphs

Given an undirected graph, the resistance distance between two nodes is the resistance one would measure between these two nodes in an electrical network if edges were resistors. Summing these distances over all pairs of nodes yields the so-called Kirchhoff index of the graph, which measures its overall connectivity. In this work, we consider Erdos-Renyi random graphs. Since the graphs are random, their Kirchhoff indices are random variables. We give formulas for the expected value of the Kirchhoff index and show it concentrates around its expectation. We achieve this by studying the trace of the pseudoinverse of the Laplacian of Erdos-Renyi graphs. For synchronization (a class of estimation problems on graphs) our results imply that acquiring pairwise measurements uniformly at random is a good strategy, even if only a vanishing proportion of the measurements can be acquired.

preprint2014arXiv

Unsupervised Deep Haar Scattering on Graphs

The classification of high-dimensional data defined on graphs is particularly difficult when the graph geometry is unknown. We introduce a Haar scattering transform on graphs, which computes invariant signal descriptors. It is implemented with a deep cascade of additions, subtractions and absolute values, which iteratively compute orthogonal Haar wavelet transforms. Multiscale neighborhoods of unknown graphs are estimated by minimizing an average total variation, with a pair matching algorithm of polynomial complexity. Supervised classification with dimension reduction is tested on data bases of scrambled images, and for signals sampled on unknown irregular grids on a sphere.

preprint2012arXiv

The Spectrum of Random Inner-product Kernel Matrices

We consider n-by-n matrices whose (i, j)-th entry is f(X_i^T X_j), where X_1, ...,X_n are i.i.d. standard Gaussian random vectors in R^p, and f is a real-valued function. The eigenvalue distribution of these random kernel matrices is studied at the "large p, large n" regime. It is shown that, when p and n go to infinity, p/n = γwhich is a constant, and f is properly scaled so that Var(f(X_i^T X_j)) is O(p^{-1}), the spectral density converges weakly to a limiting density on R. The limiting density is dictated by a cubic equation involving its Stieltjes transform. While for smooth kernel functions the limiting spectral density has been previously shown to be the Marcenko-Pastur distribution, our analysis is applicable to non-smooth kernel functions, resulting in a new family of limiting densities.

Xiuyuan Cheng

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Point processes with event time uncertainty

Neural Spectral Marked Point Processes

Scaling-Translation-Equivariant Networks with Decomposed Convolutional Filters

Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples

Convergence of Graph Laplacian with kNN Self-tuned Kernels

ACDC: Weight Sharing in Atom-Coefficient Decomposed Convolution

Butterfly-Net: Optimal Function Representation Based on Convolutional Neural Networks

Butterfly-Net2: Simplified Butterfly-Net and Fourier Transform Initialization

Classification Logit Two-sample Testing by Neural Networks

On Matrix Rearrangement Inequalities

Stochastic Conditional Generative Networks with Basis Decomposition

Variational Diffusion Autoencoders with Random Walk Sampling

Marchenko-Pastur Law for Tyler's M-estimator

On the Diffusion Geometry of Graph Laplacians and Applications

Deep Haar Scattering Networks

Concentration of the Kirchhoff index for Erdos-Renyi graphs

Unsupervised Deep Haar Scattering on Graphs

The Spectrum of Random Inner-product Kernel Matrices