Researcher profile

Johan A. K. Suykens

Johan A. K. Suykens contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2023arXiv

Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer

The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their natural form. Despite its simplicity, solving jigsaw puzzle has been demonstrated to be helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as self-supervised feature representation learning, domain generalization, and fine-grained classification. In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two modifications that can make Jigsaw-ViT superior to standard ViT: discarding positional embeddings and masking patches randomly. Yet simple, we find that Jigsaw-ViT is able to improve both in generalization and robustness over the standard ViT, which is usually rather a trade-off. Experimentally, we show that adding the jigsaw puzzle branch provides better generalization than ViT on large-scale image classification on ImageNet. Moreover, the auxiliary task also improves robustness to noisy labels on Animal-10N, Food-101N, and Clothing1M as well as adversarial examples. Our implementation is available at https://yingyichen-cyy.github.io/Jigsaw-ViT/.

preprint2022arXiv

Compressing Features for Learning with Noisy Labels

Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper, we focus on the problem of learning with noisy labels and introduce compression inductive bias to network architectures to alleviate this over-fitting problem. More precisely, we revisit one classical regularization named Dropout and its variant Nested Dropout. Dropout can serve as a compression constraint for its feature dropping mechanism, while Nested Dropout further learns ordered feature representations w.r.t. feature importance. Moreover, the trained models with compression regularization are further combined with Co-teaching for performance boost. Theoretically, we conduct bias-variance decomposition of the objective function under compression regularization. We analyze it for both single model and Co-teaching. This decomposition provides three insights: (i) it shows that over-fitting is indeed an issue for learning with noisy labels; (ii) through an information bottleneck formulation, it explains why the proposed feature compression helps in combating label noise; (iii) it gives explanations on the performance boost brought by incorporating compression regularization into Co-teaching. Experiments show that our simple approach can have comparable or even better performance than the state-of-the-art methods on benchmarks with real-world label noise including Clothing1M and ANIMAL-10N. Our implementation is available at https://yingyichen-cyy.github.io/CompressFeatNoisyLabels/.

preprint2022arXiv

Disentangled Representation Learning and Generation with Manifold Optimization

Disentanglement is a useful property in representation learning which increases the interpretability of generative models such as Variational autoencoders (VAE), Generative Adversarial Models, and their many variants. Typically in such models, an increase in disentanglement performance is traded-off with generation quality. In the context of latent space models, this work presents a representation learning framework that explicitly promotes disentanglement by encouraging orthogonal directions of variations. The proposed objective is the sum of an autoencoder error term along with a Principal Component Analysis reconstruction error in the feature space. This has an interpretation of a Restricted Kernel Machine with the eigenvector matrix-valued on the Stiefel manifold. Our analysis shows that such a construction promotes disentanglement by matching the principal directions in the latent space with the directions of orthogonal variation in data space. In an alternating minimization scheme, we use Cayley ADAM algorithm - a stochastic optimization method on the Stiefel manifold along with the ADAM optimizer. Our theoretical discussion and various experiments show that the proposed model improves over many VAE variants in terms of both generation quality and disentangled representation learning.

preprint2022arXiv

Learning with Asymmetric Kernels: Least Squares and Feature Interpretation

Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on the Corel database, directed graphs, and the UCI database will show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels.

preprint2022arXiv

Piecewise Linear Neural Networks and Deep Learning

As a powerful modelling method, PieceWise Linear Neural Networks (PWLNNs) have proven successful in various fields, most recently in deep learning. To apply PWLNN methods, both the representation and the learning have long been studied. In 1977, the canonical representation pioneered the works of shallow PWLNNs learned by incremental designs, but the applications to large-scale data were prohibited. In 2010, the Rectified Linear Unit (ReLU) advocated the prevalence of PWLNNs in deep learning. Ever since, PWLNNs have been successfully applied to extensive tasks and achieved advantageous performances. In this Primer, we systematically introduce the methodology of PWLNNs by grouping the works into shallow and deep networks. Firstly, different PWLNN representation models are constructed with elaborated examples. With PWLNNs, the evolution of learning algorithms for data is presented and fundamental theoretical analysis follows up for in-depth understandings. Then, representative applications are introduced together with discussions and outlooks.

preprint2021arXiv

Determinantal Point Processes Implicitly Regularize Semi-parametric Regression Problems

Semi-parametric regression models are used in several applications which require comprehensibility without sacrificing accuracy. Typical examples are spline interpolation in geophysics, or non-linear time series problems, where the system includes a linear and non-linear component. We discuss here the use of a finite Determinantal Point Process (DPP) for approximating semi-parametric models. Recently, Barthelmé, Tremblay, Usevich, and Amblard introduced a novel representation of some finite DPPs. These authors formulated extended L-ensembles that can conveniently represent partial-projection DPPs and suggest their use for optimal interpolation. With the help of this formalism, we derive a key identity illustrating the implicit regularization effect of determinantal sampling for semi-parametric regression and interpolation. Also, a novel projected Nyström approximation is defined and used to derive a bound on the expected risk for the corresponding approximation of semi-parametric regression. This work naturally extends similar results obtained for kernel ridge regression.

preprint2021arXiv

Fast Learning in Reproducing Kernel Krein Spaces via Signed Measures

In this paper, we attempt to solve a long-lasting open question for non-positive definite (non-PD) kernels in machine learning community: can a given non-PD kernel be decomposed into the difference of two PD kernels (termed as positive decomposition)? We cast this question as a distribution view by introducing the \emph{signed measure}, which transforms positive decomposition to measure decomposition: a series of non-PD kernels can be associated with the linear combination of specific finite Borel measures. In this manner, our distribution-based framework provides a sufficient and necessary condition to answer this open question. Specifically, this solution is also computationally implementable in practice to scale non-PD kernels in large sample cases, which allows us to devise the first random features algorithm to obtain an unbiased estimator. Experimental results on several benchmark datasets verify the effectiveness of our algorithm over the existing methods.

preprint2021arXiv

Kernel regression in high dimensions: Refined analysis beyond double descent

In this paper, we provide a precise characterization of generalization properties of high dimensional kernel ridge regression across the under- and over-parameterized regimes, depending on whether the number of training data n exceeds the feature dimension d. By establishing a bias-variance decomposition of the expected excess risk, we show that, while the bias is (almost) independent of d and monotonically decreases with n, the variance depends on n, d and can be unimodal or monotonically decreasing under different regularization schemes. Our refined analysis goes beyond the double descent theory by showing that, depending on the data eigen-profile and the level of regularization, the kernel regression risk curve can be a double-descent-like, bell-shaped, or monotonic function of n. Experiments on synthetic and real data are conducted to support our theoretical findings.

preprint2021arXiv

Unsupervised Energy-based Out-of-distribution Detection using Stiefel-Restricted Kernel Machine

Detecting out-of-distribution (OOD) samples is an essential requirement for the deployment of machine learning systems in the real world. Until now, research on energy-based OOD detectors has focused on the softmax confidence score from a pre-trained neural network classifier with access to class labels. In contrast, we propose an unsupervised energy-based OOD detector leveraging the Stiefel-Restricted Kernel Machine (St-RKM). Training requires minimizing an objective function with an autoencoder loss term and the RKM energy where the interconnection matrix lies on the Stiefel manifold. Further, we outline multiple energy function definitions based on the RKM framework and discuss their utility. In the experiments on standard datasets, the proposed method improves over the existing energy-based OOD detectors and deep generative models. Through several ablation studies, we further illustrate the merit of each proposed energy function on the OOD detection performance.

preprint2020arXiv

A Statistical Learning Approach to Modal Regression

This paper studies the nonparametric modal regression problem systematically from a statistical learning view. Originally motivated by pursuing a theoretical understanding of the maximum correntropy criterion based regression (MCCR), our study reveals that MCCR with a tending-to-zero scale parameter is essentially modal regression. We show that nonparametric modal regression problem can be approached via the classical empirical risk minimization. Some efforts are then made to develop a framework for analyzing and implementing modal regression. For instance, the modal regression function is described, the modal regression risk is defined explicitly and its \textit{Bayes} rule is characterized; for the sake of computational tractability, the surrogate modal regression risk, which is termed as the generalization risk in our study, is introduced. On the theoretical side, the excess modal regression risk, the excess generalization risk, the function estimation error, and the relations among the above three quantities are studied rigorously. It turns out that under mild conditions, function estimation consistency and convergence may be pursued in modal regression as in vanilla regression protocols, such as mean regression, median regression, and quantile regression. However, it outperforms these regression models in terms of robustness as shown in our study from a re-descending M-estimation view. This coincides with and in return explains the merits of MCCR on robustness. On the practical side, the implementation issues of modal regression including the computational algorithm and the tuning parameters selection are discussed. Numerical assessments on modal regression are also conducted to verify our findings empirically.

preprint2020arXiv

Diversity sampling is an implicit regularization for kernel methods

Kernel methods have achieved very good performance on large scale regression and classification problems, by using the Nyström method and preconditioning techniques. The Nyström approximation -- based on a subset of landmarks -- gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We further elaborate on the impact of sampling diverse landmarks for constructing the Nyström approximation in supervised as well as unsupervised kernel methods. By using Determinantal Point Processes for sampling, we obtain additional theoretical results concerning the interplay between diversity and regularization. Empirically, we demonstrate the advantages of training kernel methods based on subsets made of diverse points. In particular, if the dataset has a dense bulk and a sparser tail, we show that Nyström kernel regression with diverse landmarks increases the accuracy of the regression in sparser regions of the dataset, with respect to a uniform landmark sampling. A greedy heuristic is also proposed to select diverse samples of significant size within large datasets when exact DPP sampling is not practically feasible.

preprint2020arXiv

Indefinite Kernel Logistic Regression with Concave-inexact-convex Procedure

In kernel methods, the kernels are often required to be positive definite, which restricts the use of many indefinite kernels. To consider those non-positive definite kernels, in this paper, we aim to build an indefinite kernel learning framework for kernel logistic regression. The proposed indefinite kernel logistic regression (IKLR) model is analysed in the Reproducing Kernel Kre\uın Spaces (RKKS) and then becomes non-convex. Using the positive decomposition of a non-positive definite kernel, the derived IKLR model can be decomposed into the difference of two convex functions. Accordingly, a concave-convex procedure is introduced to solve the non-convex optimization problem. Since the concave-convex procedure has to solve a sub-problem in each iteration, we propose a concave-inexact-convex procedure (CCICP) algorithm with an inexact solving scheme to accelerate the solving process. Besides, we propose a stochastic variant of CCICP to efficiently obtain a proximal solution, which achieves the similar purpose with the inexact solving scheme in CCICP. The convergence analyses of the above two variants of concave-convex procedure are conducted. By doing so, our method works effectively not only under a deterministic setting but also under a stochastic setting. Experimental results on several benchmarks suggest that the proposed IKLR model performs favorably against the standard (positive-definite) kernel logistic regression and other competitive indefinite learning based algorithms.

preprint2020arXiv

Robust Generative Restricted Kernel Machines using Weighted Conjugate Feature Duality

Interest in generative models has grown tremendously in the past decade. However, their training performance can be adversely affected by contamination, where outliers are encoded in the representation of the model. This results in the generation of noisy data. In this paper, we introduce weighted conjugate feature duality in the framework of Restricted Kernel Machines (RKMs). The RKM formulation allows for an easy integration of methods from classical robust statistics. This formulation is used to fine-tune the latent space of generative RKMs using a weighting function based on the Minimum Covariance Determinant, which is a highly robust estimator of multivariate location and scatter. Experiments show that the weighted RKM is capable of generating clean images when contamination is present in the training data. We further show that the robust method also preserves uncorrelated feature learning through qualitative and quantitative experiments on standard datasets.

preprint2020arXiv

Wasserstein Exponential Kernels

In the context of kernel methods, the similarity between data points is encoded by the kernel function which is often defined thanks to the Euclidean distance, a common example being the squared exponential kernel. Recently, other distances relying on optimal transport theory - such as the Wasserstein distance between probability distributions - have shown their practical relevance for different machine learning techniques. In this paper, we study the use of exponential kernels defined thanks to the regularized Wasserstein distance and discuss their positive definiteness. More specifically, we define Wasserstein feature maps and illustrate their interest for supervised learning problems involving shapes and images. Empirically, Wasserstein squared exponential kernels are shown to yield smaller classification errors on small training sets of shapes, compared to analogous classifiers using Euclidean distances.