Source author record

Richard Yi Da Xu

Richard Yi Da Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Vision Information Retrieval Social and Information Networks Artificial Intelligence Computation and Language eess.SP

Catalog footprint

What is connected

18works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Rethinking Generalization in Graph Neural Networks: A Structural Complexity Perspective

Graph neural networks (GNNs) have emerged as a fundamental tool for learning from graph-structured data, achieving strong performance across a wide range of applications. However, understanding their generalization capabilities remains challenging due to the complex structural dependencies inherent in such data. Existing generalization analyses largely follow the classical machine learning paradigm, focusing primarily on model complexity while overlooking the fundamental role of graph structure. Therefore, in this work, we systematically investigate this role by asking: does the graph structure actually influence generalization, and if so, by how much? To answer the first question and validate our intuition, we theoretically prove that incorporating more edges into the prediction process transforms the input representations to be overly accommodating to the output model, thereby inducing overfitting. To address the second question, we formulate a structural complexity measure based on the number of effective edges and derive a Rademacher complexity-based generalization bound. In doing so, we demonstrate that GNN generalization depends explicitly on structural complexity, alongside traditional parameter-dependent factors. Motivated by these theoretical findings, we propose a structural entropy regularization method. This approach controls structural complexity by regulating effective edges to balance underfitting and overfitting, ultimately improving the generalization performance of GNNs.

preprint2023arXiv

A variational autoencoder-based nonnegative matrix factorisation model for deep dictionary learning

Construction of dictionaries using nonnegative matrix factorisation (NMF) has extensive applications in signal processing and machine learning. With the advances in deep learning, training compact and robust dictionaries using deep neural networks, i.e., dictionaries of deep features, has been proposed. In this study, we propose a probabilistic generative model which employs a variational autoencoder (VAE) to perform nonnegative dictionary learning. In contrast to the existing VAE models, we cast the model under a statistical framework with latent variables obeying a Gamma distribution and design a new loss function to guarantee the nonnegative dictionaries. We adopt an acceptance-rejection sampling reparameterization trick to update the latent variables iteratively. We apply the dictionaries learned from VAE-NMF to two signal processing tasks, i.e., enhancement of speech and extraction of muscle synergies. Experimental results demonstrate that VAE-NMF performs better in learning the latent nonnegative dictionaries in comparison with state-of-the-art methods.

preprint2022arXiv

Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning

PAC-Bayesian is an analysis framework where the training error can be expressed as the weighted average of the hypotheses in the posterior distribution whilst incorporating the prior knowledge. In addition to being a pure generalization bound analysis tool, PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, making them a powerful and relevant framework that can numerically provide a tight generalization bound for supervised learning. For simplicity, we call probabilistic neural network learned using training objectives derived from PAC-Bayesian bounds as {\it PAC-Bayesian learning}. Despite their empirical success, the theoretical analysis of PAC-Bayesian learning for neural networks is rarely explored. This paper proposes a new class of convergence and generalization analysis for PAC-Bayes learning when it is used to train the over-parameterized neural networks by the gradient descent method. For a wide probabilistic neural network, we show that when PAC-Bayes learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further characterize the uniform PAC-Bayesian generalization bound which improves over the Rademacher complexity-based bound for non-probabilistic neural network. Finally, drawing the insight from our theoretical results, we propose a proxy measure for efficient hyperparameters selection, which is proven to be time-saving.

preprint2022arXiv

Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective

Graph convolutional networks (GCNs) and their variants have achieved great success in dealing with graph-structured data. Nevertheless, it is well known that deep GCNs suffer from the over-smoothing problem, where node representations tend to be indistinguishable as more layers are stacked up. The theoretical research to date on deep GCNs has focused primarily on expressive power rather than trainability, an optimization perspective. Compared to expressivity, trainability attempts to address a more fundamental question: Given a sufficiently expressive space of models, can we successfully find a good solution via gradient descent-based optimizers? This work fills this gap by exploiting the Graph Neural Tangent Kernel (GNTK), which governs the optimization trajectory under gradient descent for wide GCNs. We formulate the asymptotic behaviors of GNTK in the large depth, which enables us to reveal the dropping trainability of wide and deep GCNs at an exponential rate in the optimization process. Additionally, we extend our theoretical framework to analyze residual connection-based techniques, which are found to be merely able to mitigate the exponential decay of trainability mildly. Inspired by our theoretical insights on trainability, we propose Critical DropEdge, a connectivity-aware and graph-adaptive sampling method, to alleviate the exponential decay problem more fundamentally. Experimental evaluation consistently confirms using our proposed method can achieve better results compared to relevant counterparts with both infinite-width and finite-width.

preprint2021arXiv

Resolution-invariant Person ReID Based on Feature Transformation and Self-weighted Attention

Person Re-identification (ReID) is a critical computer vision task which aims to match the same person in images or video sequences. Most current works focus on settings where the resolution of images is kept the same. However, the resolution is a crucial factor in person ReID, especially when the cameras are at different distances from the person or the camera's models are different from each other. In this paper, we propose a novel two-stream network with a lightweight resolution association ReID feature transformation (RAFT) module and a self-weighted attention (SWA) ReID module to evaluate features under different resolutions. RAFT transforms the low resolution features to corresponding high resolution features. SWA evaluates both features to get weight factors for the person ReID. Both modules are jointly trained to get a resolution-invariant representation. Extensive experiments on five benchmark datasets show the effectiveness of our method. For instance, we achieve Rank-1 accuracy of 43.3% and 83.2% on CAVIAR and MLR-CUHK03, outperforming the state-of-the-art.

preprint2020arXiv

Mean field theory for deep dropout networks: digging up gradient backpropagation deeply

In recent years, the mean field theory has been applied to the study of neural networks and has achieved a great deal of success. The theory has been applied to various neural network structures, including CNNs, RNNs, Residual networks, and Batch normalization. Inevitably, recent work has also covered the use of dropout. The mean field theory shows that the existence of depth scales that limit the maximum depth of signal propagation and gradient backpropagation. However, the gradient backpropagation is derived under the gradient independence assumption that weights used during feed forward are drawn independently from the ones used in backpropagation. This is not how neural networks are trained in a real setting. Instead, the same weights used in a feed-forward step needs to be carried over to its corresponding backpropagation. Using this realistic condition, we perform theoretical computation on linear dropout networks and a series of experiments on dropout networks. Our empirical results show an interesting phenomenon that the length gradients can backpropagate for a single input and a pair of inputs are governed by the same depth scale. Besides, we study the relationship between variance and mean of statistical metrics of the gradient and shown an emergence of universality. Finally, we investigate the maximum trainable length for deep dropout networks through a series of experiments using MNIST and CIFAR10 and provide a more precise empirical formula that describes the trainable length than original work.

preprint2020arXiv

RGB-IR Cross-modality Person ReID based on Teacher-Student GAN Model

RGB-Infrared (RGB-IR) person re-identification (ReID) is a technology where the system can automatically identify the same person appearing at different parts of a video when light is unavailable. The critical challenge of this task is the cross-modality gap of features under different modalities. To solve this challenge, we proposed a Teacher-Student GAN model (TS-GAN) to adopt different domains and guide the ReID backbone to learn better ReID information. (1) In order to get corresponding RGB-IR image pairs, the RGB-IR Generative Adversarial Network (GAN) was used to generate IR images. (2) To kick-start the training of identities, a ReID Teacher module was trained under IR modality person images, which is then used to guide its Student counterpart in training. (3) Likewise, to better adapt different domain features and enhance model ReID performance, three Teacher-Student loss functions were used. Unlike other GAN based models, the proposed model only needs the backbone module at the test stage, making it more efficient and resource-saving. To showcase our model's capability, we did extensive experiments on the newly-released SYSU-MM01 RGB-IR Re-ID benchmark and achieved superior performance to the state-of-the-art with 49.8% Rank-1 and 47.4% mAP.

preprint2016arXiv

Bayesian nonparametric image segmentation using a generalized Swendsen-Wang algorithm

Unsupervised image segmentation aims at clustering the set of pixels of an image into spatially homogeneous regions. We introduce here a class of Bayesian nonparametric models to address this problem. These models are based on a combination of a Potts-like spatial smoothness component and a prior on partitions which is used to control both the number and size of clusters. This class of models is flexible enough to include the standard Potts model and the more recent Potts-Dirichlet Process model \cite{Orbanz2008}. More importantly, any prior on partitions can be introduced to control the global clustering structure so that it is possible to penalize small or large clusters if necessary. Bayesian computation is carried out using an original generalized Swendsen-Wang algorithm. Experiments demonstrate that our method is competitive in terms of RAND\ index compared to popular image segmentation methods, such as mean-shift, and recent alternative Bayesian nonparametric models.

preprint2016arXiv

Smoothed Hierarchical Dirichlet Process: A Non-Parametric Approach to Constraint Measures

Time-varying mixture densities occur in many scenarios, for example, the distributions of keywords that appear in publications may evolve from year to year, video frame features associated with multiple targets may evolve in a sequence. Any models that realistically cater to this phenomenon must exhibit two important properties: the underlying mixture densities must have an unknown number of mixtures, and there must be some "smoothness" constraints in place for the adjacent mixture densities. The traditional Hierarchical Dirichlet Process (HDP) may be suited to the first property, but certainly not the second. This is due to how each random measure in the lower hierarchies is sampled independent of each other and hence does not facilitate any temporal correlations. To overcome such shortcomings, we proposed a new Smoothed Hierarchical Dirichlet Process (sHDP). The key novelty of this model is that we place a temporal constraint amongst the nearby discrete measures $\{G_j\}$ in the form of symmetric Kullback-Leibler (KL) Divergence with a fixed bound $B$. Although the constraint we place only involves a single scalar value, it nonetheless allows for flexibility in the corresponding successive measures. Remarkably, it also led us to infer the model within the stick-breaking process where the traditional Beta distribution used in stick-breaking is now replaced by a new constraint calculated from $B$. We present the inference algorithm and elaborate on its solutions. Our experiment using NIPS keywords has shown the desirable effect of the model.

preprint2016arXiv

The Dependent Random Measures with Independent Increments in Mixture Models

When observations are organized into groups where commonalties exist amongst them, the dependent random measures can be an ideal choice for modeling. One of the propositions of the dependent random measures is that the atoms of the posterior distribution are shared amongst groups, and hence groups can borrow information from each other. When normalized dependent random measures prior with independent increments are applied, we can derive appropriate exchangeable probability partition function (EPPF), and subsequently also deduce its inference algorithm given any mixture model likelihood. We provide all necessary derivation and solution to this framework. For demonstration, we used mixture of Gaussians likelihood in combination with a dependent structure constructed by linear combinations of CRMs. Our experiments show superior performance when using this framework, where the inferred values including the mixing weights and the number of clusters both respond appropriately to the number of completely random measure used.

preprint2015arXiv

An Adaptive Online HDP-HMM for Segmentation and Classification of Sequential Data

In the recent years, the desire and need to understand sequential data has been increasing, with particular interest in sequential contexts such as patient monitoring, understanding daily activities, video surveillance, stock market and the like. Along with the constant flow of data, it is critical to classify and segment the observations on-the-fly, without being limited to a rigid number of classes. In addition, the model needs to be capable of updating its parameters to comply with possible evolutions. This interesting problem, however, is not adequately addressed in the literature since many studies focus on offline classification over a pre-defined class set. In this paper, we propose a principled solution to this gap by introducing an adaptive online system based on Markov switching models with hierarchical Dirichlet process priors. This infinite adaptive online approach is capable of segmenting and classifying the sequential data over unlimited number of classes, while meeting the memory and delay constraints of streaming contexts. The model is further enhanced by introducing a learning rate, responsible for balancing the extent to which the model sustains its previous learning (parameters) or adapts to the new streaming observations. Experimental results on several variants of stationary and evolving synthetic data and two video datasets, TUM Assistive Kitchen and collatedWeizmann, show remarkable performance in segmentation and classification, particularly for evolutionary sequences with changing distributions and/or containing new, unseen classes.

preprint2015arXiv

Dependent Indian Buffet Process-based Sparse Nonparametric Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two optimized nonnegative matrices appropriate for the intended applications. The method has been widely used for unsupervised learning tasks, including recommender systems (rating matrix of users by items) and document clustering (weighting matrix of papers by keywords). However, traditional NMF methods typically assume the number of latent factors (i.e., dimensionality of the loading matrices) to be fixed. This assumption makes them inflexible for many applications. In this paper, we propose a nonparametric NMF framework to mitigate this issue by using dependent Indian Buffet Processes (dIBP). In a nutshell, we apply a correlation function for the generation of two stick weights associated with each pair of columns of loading matrices, while still maintaining their respective marginal distribution specified by IBP. As a consequence, the generation of two loading matrices will be column-wise (indirectly) correlated. Under this same framework, two classes of correlation function are proposed (1) using Bivariate beta distribution and (2) using Copula function. Both methods allow us to adopt our work for various applications by flexibly choosing an appropriate parameter settings. Compared with the other state-of-the art approaches in this area, such as using Gaussian Process (GP)-based dIBP, our work is seen to be much more flexible in terms of allowing the two corresponding binary matrix columns to have greater variations in their non-zero entries. Our experiments on the real-world and synthetic datasets show that three proposed models perform well on the document clustering task comparing standard NMF without predefining the dimension for the factor matrices, and the Bivariate beta distribution-based and Copula-based models have better flexibility than the GP-based model.

preprint2015arXiv

Infinite Author Topic Model based on Mixed Gamma-Negative Binomial Process

Incorporating the side information of text corpus, i.e., authors, time stamps, and emotional tags, into the traditional text mining models has gained significant interests in the area of information retrieval, statistical natural language processing, and machine learning. One branch of these works is the so-called Author Topic Model (ATM), which incorporates the authors's interests as side information into the classical topic model. However, the existing ATM needs to predefine the number of topics, which is difficult and inappropriate in many real-world settings. In this paper, we propose an Infinite Author Topic (IAT) model to resolve this issue. Instead of assigning a discrete probability on fixed number of topics, we use a stochastic process to determine the number of topics from the data itself. To be specific, we extend a gamma-negative binomial process to three levels in order to capture the author-document-keyword hierarchical structure. Furthermore, each document is assigned a mixed gamma process that accounts for the multi-author's contribution towards this document. An efficient Gibbs sampling inference algorithm with each conditional distribution being closed-form is developed for the IAT model. Experiments on several real-world datasets show the capabilities of our IAT model to learn the hidden topics, authors' interests on these topics and the number of topics simultaneously.

preprint2015arXiv

Nonparametric Relational Topic Models through Dependent Gamma Processes

Traditional Relational Topic Models provide a way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, link prediction, benefit from this revealed knowledge. However, existing relational topic models are based on an assumption that the number of hidden topics is known in advance, and this is impractical in many real-world applications. Therefore, in order to relax this assumption, we propose a nonparametric relational topic model in this paper. Instead of using fixed-dimensional probability distributions in its generative model, we use stochastic processes. Specifically, a gamma process is assigned to each document, which represents the topic interest of this document. Although this method provides an elegant solution, it brings additional challenges when mathematically modeling the inherent network structure of typical document network, i.e., two spatially closer documents tend to have more similar topics. Furthermore, we require that the topics are shared by all the documents. In order to resolve these challenges, we use a subsampling strategy to assign each document a different gamma process from the global gamma process, and the subsampling probabilities of documents are assigned with a Markov Random Field constraint that inherits the document network structure. Through the designed posterior inference algorithm, we can discover the hidden topics and its number simultaneously. Experimental results on both synthetic and real-world network datasets demonstrate the capabilities of learning the hidden topics and, more importantly, the number of topics.

preprint2013arXiv

A non-parametric conditional factor regression model for high-dimensional input and response

In this paper, we propose a non-parametric conditional factor regression (NCFR)model for domains with high-dimensional input and response. NCFR enhances linear regression in two ways: a) introducing low-dimensional latent factors leading to dimensionality reduction and b) integrating an Indian Buffet Process as a prior for the latent factors to derive unlimited sparse dimensions. Experimental results comparing NCRF to several alternatives give evidence to remarkable prediction performance.

preprint2013arXiv

Copula Mixed-Membership Stochastic Blockmodel for Intra-Subgroup Correlations

The \emph{Mixed-Membership Stochastic Blockmodel (MMSB)} is a popular framework for modeling social network relationships. It can fully exploit each individual node's participation (or membership) in a social structure. Despite its powerful representations, this model makes an assumption that the distributions of relational membership indicators between two nodes are independent. Under many social network settings, however, it is possible that certain known subgroups of people may have high or low correlations in terms of their membership categories towards each other, and such prior information should be incorporated into the model. To this end, we introduce a \emph{Copula Mixed-Membership Stochastic Blockmodel (cMMSB)} where an individual Copula function is employed to jointly model the membership pairs of those nodes within the subgroup of interest. The model enables the use of various Copula functions to suit the scenario, while maintaining the membership's marginal distribution, as needed, for modeling membership indicators with other nodes outside of the subgroup of interest. We describe the proposed model and its inference algorithm in detail for both the finite and infinite cases. In the experiment section, we compare our algorithms with other popular models in terms of link prediction, using both synthetic and real world data.

preprint2013arXiv

Dynamic Infinite Mixed-Membership Stochastic Blockmodel

Directional and pairwise measurements are often used to model inter-relationships in a social network setting. The Mixed-Membership Stochastic Blockmodel (MMSB) was a seminal work in this area, and many of its capabilities were extended since then. In this paper, we propose the \emph{Dynamic Infinite Mixed-Membership stochastic blockModel (DIM3)}, a generalised framework that extends the existing work to a potentially infinite number of communities and mixture memberships for each of the network's nodes. This model is in a dynamic setting, where additional model parameters are introduced to reflect the degree of persistence between one's memberships at consecutive times. Accordingly, two effective posterior sampling strategies and their results are presented using both synthetic and real data.

preprint2013arXiv

Learning Hidden Structures with Relational Models by Adequately Involving Rich Information in A Network

Effectively modelling hidden structures in a network is very practical but theoretically challenging. Existing relational models only involve very limited information, namely the binary directional link data, embedded in a network to learn hidden networking structures. There is other rich and meaningful information (e.g., various attributes of entities and more granular information than binary elements such as "like" or "dislike") missed, which play a critical role in forming and understanding relations in a network. In this work, we propose an informative relational model (InfRM) framework to adequately involve rich information and its granularity in a network, including metadata information about each entity and various forms of link data. Firstly, an effective metadata information incorporation method is employed on the prior information from relational models MMSB and LFRM. This is to encourage the entities with similar metadata information to have similar hidden structures. Secondly, we propose various solutions to cater for alternative forms of link data. Substantial efforts have been made towards modelling appropriateness and efficiency, for example, using conjugate priors. We evaluate our framework and its inference algorithms in different datasets, which shows the generality and effectiveness of our models in capturing implicit structures in networks.

Richard Yi Da Xu

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Rethinking Generalization in Graph Neural Networks: A Structural Complexity Perspective

A variational autoencoder-based nonnegative matrix factorisation model for deep dictionary learning

Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning

Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective

Resolution-invariant Person ReID Based on Feature Transformation and Self-weighted Attention

Mean field theory for deep dropout networks: digging up gradient backpropagation deeply

RGB-IR Cross-modality Person ReID based on Teacher-Student GAN Model

Bayesian nonparametric image segmentation using a generalized Swendsen-Wang algorithm

Smoothed Hierarchical Dirichlet Process: A Non-Parametric Approach to Constraint Measures

The Dependent Random Measures with Independent Increments in Mixture Models

An Adaptive Online HDP-HMM for Segmentation and Classification of Sequential Data

Dependent Indian Buffet Process-based Sparse Nonparametric Nonnegative Matrix Factorization

Infinite Author Topic Model based on Mixed Gamma-Negative Binomial Process

Nonparametric Relational Topic Models through Dependent Gamma Processes

A non-parametric conditional factor regression model for high-dimensional input and response

Copula Mixed-Membership Stochastic Blockmodel for Intra-Subgroup Correlations

Dynamic Infinite Mixed-Membership Stochastic Blockmodel

Learning Hidden Structures with Relational Models by Adequately Involving Rich Information in A Network