Researcher profile

Jose C. Principe

Jose C. Principe contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2023arXiv

Labels, Information, and Computation: Efficient Learning Using Sufficient Labels

In supervised learning, obtaining a large set of fully-labeled training data is expensive. We show that we do not always need full label information on every single training example to train a competent classifier. Specifically, inspired by the principle of sufficiency in statistics, we present a statistic (a summary) of the fully-labeled training set that captures almost all the relevant information for classification but at the same time is easier to obtain directly. We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency for finding the optimal hidden representations, on which competent classifier heads can be trained using as few as a single randomly-chosen fully-labeled example per class. Sufficiently-labeled data can be obtained from annotators directly without collecting the fully-labeled data first. And we prove that it is easier to directly obtain sufficiently-labeled data than obtaining fully-labeled data. Furthermore, sufficiently-labeled data is naturally more secure since it stores relative, instead of absolute, information. Extensive experimental results are provided to support our theory.

preprint2022arXiv

A Physics inspired Functional Operator for Model Uncertainty Quantification in the RKHS

Accurate uncertainty quantification of model predictions is a crucial problem in machine learning. Existing Bayesian methods, being highly iterative, are expensive to implement and often fail to accurately capture a model's true posterior because of their tendency to select only central moments. We propose a fast single-shot uncertainty quantification framework where, instead of working with the conventional Bayesian definition of model weight probability density function (PDF), we utilize physics inspired functional operators over the projection of model weights in a reproducing kernel Hilbert space (RKHS) to quantify their uncertainty at each model output. The RKHS projection of model weights yields a potential field based interpretation of model weight PDF which consequently allows the definition of a functional operator, inspired by perturbation theory in physics, that performs a moment decomposition of the model weight PDF (the potential field) at a specific model output to quantify its uncertainty. We call this representation of the model weight PDF as the quantum information potential field (QIPF) of the weights. The extracted moments from this approach automatically decompose the weight PDF in the local neighborhood of the specified model output and determine, with great sensitivity, the local heterogeneity of the weight PDF around a given prediction. These moments therefore provide sharper estimates of predictive uncertainty than central stochastic moments of Bayesian methods. Experiments evaluating the error detection capability of different uncertainty quantification methods on covariate shifted test data show our approach to be more precise and better calibrated than baseline methods, while being faster to compute.

preprint2022arXiv

Adapting the Exploration Rate for Value-of-Information-Based Reinforcement Learning

In this paper, we consider the problem of adjusting the exploration rate when using value-of-information-based exploration. We do this by converting the value-of-information optimization into a problem of finding equilibria of a flow for a changing exploration rate. We then develop an efficient path-following scheme for converging to these equilibria and hence uncovering optimal action-selection policies. Under this scheme, the exploration rate is automatically adapted according to the agent's experiences. Global convergence is theoretically assured. We first evaluate our exploration-rate adaptation on the Nintendo GameBoy games Centipede and Millipede. We demonstrate aspects of the search process, like that it yields a hierarchy of state abstractions. We also show that our approach returns better policies in fewer episodes than conventional search strategies relying on heuristic, annealing-based exploration-rate adjustments. We then illustrate that these trends hold for deep, value-of-information-based agents that learn to play ten simple games and over forty more complicated games for the Nintendo GameBoy system. Performance either near or well above the level of human play is observed.

preprint2022arXiv

Deep Deterministic Independent Component Analysis for Hyperspectral Unmixing

We develop a new neural network based independent component analysis (ICA) method by directly minimizing the dependence amongst all extracted components. Using the matrix-based R{é}nyi's $α$-order entropy functional, our network can be directly optimized by stochastic gradient descent (SGD), without any variational approximation or adversarial training. As a solid application, we evaluate our ICA in the problem of hyperspectral unmixing (HU) and refute a statement that "\emph{ICA does not play a role in unmixing hyperspectral data}", which was initially suggested by \cite{nascimento2005does}. Code and additional remarks of our DDICA is available at https://github.com/hongmingli1995/DDICA.

preprint2022arXiv

Information Theoretic Structured Generative Modeling

Rényi's information provides a theoretical foundation for tractable and data-efficient non-parametric density estimation, based on pair-wise evaluations in a reproducing kernel Hilbert space (RKHS). This paper extends this framework to parametric probabilistic modeling, motivated by the fact that Rényi's information can be estimated in closed-form for Gaussian mixtures. Based on this special connection, a novel generative model framework called the structured generative model (SGM) is proposed that makes straightforward optimization possible, because costs are scale-invariant, avoiding high gradient variance while imposing less restrictions on absolute continuity, which is a huge advantage in parametric information theoretic optimization. The implementation employs a single neural network driven by an orthonormal input appended to a single white noise source adapted to learn an infinite Gaussian mixture model (IMoG), which provides an empirically tractable model distribution in low dimensions. To train SGM, we provide three novel variational cost functions, based on Rényi's second-order entropy and divergence, to implement minimization of cross-entropy, minimization of variational representations of $f$-divergence, and maximization of the evidence lower bound (conditional probability). We test the framework for estimation of mutual information and compare the results with the mutual information neural estimation (MINE), for density estimation, for conditional probability estimation in Markov models as well as for training adversarial networks. Our preliminary results show that SGM significantly improves MINE estimation in terms of data efficiency and variance, conventional and variational Gaussian mixture models, as well as the performance of generative adversarial networks.

preprint2022arXiv

Principle of Relevant Information for Graph Sparsification

Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i.e., graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$. We provide both theoretical and empirical justifications on the validity of our Graph-PRI approach. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. Code of Graph-PRI is available at https://github.com/SJYuCNEL/PRI-Graphs

preprint2022arXiv

Target Detection and Segmentation in Circular-Scan Synthetic-Aperture-Sonar Images using Semi-Supervised Convolutional Encoder-Decoders

We propose a framework for saliency-based, multi-target detection and segmentation of circular-scan, synthetic-aperture-sonar (CSAS) imagery. Our framework relies on a multi-branch, convolutional encoder-decoder network (MB-CEDN). The encoder portion of the MB-CEDN extracts visual contrast features from CSAS images. These features are fed into dual decoders that perform pixel-level segmentation to mask targets. Each decoder provides different perspectives as to what constitutes a salient target. These opinions are aggregated and cascaded into a deep-parsing network to refine the segmentation. We evaluate our framework using real-world CSAS imagery consisting of five broad target classes. We compare against existing approaches from the computer-vision literature. We show that our framework outperforms supervised, deep-saliency networks designed for natural imagery. It greatly outperforms unsupervised saliency approaches developed for natural imagery. This illustrates that natural-image-based models may need to be altered to be effective for this imaging-sonar modality.

preprint2022arXiv

The Functional Wiener Filter

This paper presents a close form solution in Reproducing Kernel Hilbert Space (RKHS) for the famed Wiener filter, which we called the functional Wiener filter(FWF). Instead of using the Wiener-Hopf factorization theory, here we define a new lagged RKHS that embeds signal statistics based on the correntropy function. In essence, we extend Parzen$'$s work on the autocorrelation function RKHS to nonlinear functional spaces. The FWF derivation is also quite different from kernel adaptive filtering (KAF) algorithms, which utilize a search approach. The analytic FWF solution is derived in the Gaussian kernel RKHS with a constant computational complexity similar to the Wiener solution, and never composes nor employs the error as in conventional optimal modeling. Because of the lack of congruence between the Gaussian RKHS and the space of time series, we compare performance of two pre-imaging algorithms: a fixed-point optimization (FWFFP) that finds and approximate solution in the RKHS, and a local model implementation named FWFLM. The experimental results show that the FWF performance is on par with the KAF for time series modeling, and it requires far less computation.

preprint2022arXiv

Training Deep Architectures Without End-to-End Backpropagation: A Survey on the Provably Optimal Methods

This tutorial paper surveys provably optimal alternatives to end-to-end backpropagation (E2EBP) -- the de facto standard for training deep architectures. Modular training refers to strictly local training without both the forward and the backward pass, i.e., dividing a deep architecture into several nonoverlapping modules and training them separately without any end-to-end operation. Between the fully global E2EBP and the strictly local modular training, there are weakly modular hybrids performing training without the backward pass only. These alternatives can match or surpass the performance of E2EBP on challenging datasets such as ImageNet, and are gaining increasing attention primarily because they offer practical advantages over E2EBP, which will be enumerated herein. In particular, they allow for greater modularity and transparency in deep learning workflows, aligning deep learning with the mainstream computer science engineering that heavily exploits modularization for scalability. Modular training has also revealed novel insights about learning and has further implications on other important research domains. Specifically, it induces natural and effective solutions to some important practical problems such as data efficiency and transferability estimation.

preprint2021arXiv

A Kernel Framework to Quantify a Model's Local Predictive Uncertainty under Data Distributional Shifts

Traditional Bayesian approaches for model uncertainty quantification rely on notoriously difficult processes of marginalization over each network parameter to estimate its probability density function (PDF). Our hypothesis is that internal layer outputs of a trained neural network contain all of the information related to both its mapping function (quantified by its weights) as well as the input data distribution. We therefore propose a framework for predictive uncertainty quantification of a trained neural network that explicitly estimates the PDF of its raw prediction space (before activation), p(y'|x,w), which we refer to as the model PDF, in a Gaussian reproducing kernel Hilbert space (RKHS). The Gaussian RKHS provides a localized density estimate of p(y'|x,w), which further enables us to utilize gradient based formulations of quantum physics to decompose the model PDF in terms of multiple local uncertainty moments that provide much greater resolution of the PDF than the central moments characterized by Bayesian methods. This provides the framework with a better ability to detect distributional shifts in test data away from the training data PDF learned by the model. We evaluate the framework against existing uncertainty quantification methods on benchmark datasets that have been corrupted using common perturbation techniques. The kernel framework is observed to provide model uncertainty estimates with much greater precision based on the ability to detect model prediction errors.

preprint2021arXiv

Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

We introduce the matrix-based Renyi's $α$-order entropy functional to parameterize Tishby et al. information bottleneck (IB) principle with a neural network. We term our methodology Deep Deterministic Information Bottleneck (DIB), as it avoids variational inference and distribution assumption. We show that deep neural networks trained with DIB outperform the variational objective counterpart and those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.Code available at https://github.com/yuxi120407/DIB

preprint2021arXiv

Interpretable Fault Detection using Projections of Mutual Information Matrix

This paper presents a novel mutual information (MI) matrix based method for fault detection. Given a $m$-dimensional fault process, the MI matrix is a $m \times m$ matrix in which the $(i,j)$-th entry measures the MI values between the $i$-th dimension and the $j$-th dimension variables. We introduce the recently proposed matrix-based Rényi's $α$-entropy functional to estimate MI values in each entry of the MI matrix. The new estimator avoids density estimation and it operates on the eigenspectrum of a (normalized) symmetric positive definite (SPD) matrix, which makes it well suited for industrial process. We combine different orders of statistics of the transformed components (TCs) extracted from the MI matrix to constitute the detection index, and derive a simple similarity index to monitor the changes of characteristics of the underlying process in consecutive windows. We term the overall methodology "projections of mutual information matrix" (PMIM). Experiments on both synthetic data and the benchmark Tennessee Eastman process demonstrate the interpretability of PMIM in identifying the root variables that cause the faults, and its superiority in detecting the occurrence of faults in terms of the improved fault detection rate (FDR) and the lowest false alarm rate (FAR). The advantages of PMIM is also less sensitive to hyper-parameters. The advantages of PMIM is also less sensitive to hyper-parameters. Code of PMIM is available at https://github.com/SJYuCNEL/Fault_detection_PMIM

preprint2021arXiv

Measuring Dependence with Matrix-based Entropy Functional

Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer's inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation ($T_α^*$) and the matrix-based normalized dual total correlation ($D_α^*$), to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks (CNNs), to demonstrate their utilities, advantages, as well as implications to those problems. Code of our dependence measure is available at: https://bit.ly/AAAI-dependence

preprint2020arXiv

Fast Estimation of Information Theoretic Learning Descriptors using Explicit Inner Product Spaces

Kernel methods form a theoretically-grounded, powerful and versatile framework to solve nonlinear problems in signal processing and machine learning. The standard approach relies on the \emph{kernel trick} to perform pairwise evaluations of a kernel function, leading to scalability issues for large datasets due to its linear and superlinear growth with respect to the training data. Recently, we proposed \emph{no-trick} (NT) kernel adaptive filtering (KAF) that leverages explicit feature space mappings using data-independent basis with constant complexity. The inner product defined by the feature mapping corresponds to a positive-definite finite-rank kernel that induces a finite-dimensional reproducing kernel Hilbert space (RKHS). Information theoretic learning (ITL) is a framework where information theory descriptors based on non-parametric estimator of Renyi entropy replace conventional second-order statistics for the design of adaptive systems. An RKHS for ITL defined on a space of probability density functions simplifies statistical inference for supervised or unsupervised learning. ITL criteria take into account the higher-order statistical behavior of the systems and signals as desired. However, this comes at a cost of increased computational complexity. In this paper, we extend the NT kernel concept to ITL for improved information extraction from the signal without compromising scalability. Specifically, we focus on a family of fast, scalable, and accurate estimators for ITL using explicit inner product space (EIPS) kernels. We demonstrate the superior performance of EIPS-ITL estimators and combined NT-KAF using EIPS-ITL cost functions through experiments.

preprint2020arXiv

Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

We propose a simple yet powerful test statistic to quantify the discrepancy between two conditional distributions. The new statistic avoids the explicit estimation of the underlying distributions in highdimensional space and it operates on the cone of symmetric positive semidefinite (SPS) matrix using the Bregman matrix divergence. Moreover, it inherits the merits of the correntropy function to explicitly incorporate high-order statistics in the data. We present the properties of our new statistic and illustrate its connections to prior art. We finally show the applications of our new statistic on three different machine learning problems, namely the multi-task learning over graphs, the concept drift detection, and the information-theoretic feature selection, to demonstrate its utility and advantage. Code of our statistic is available at https://bit.ly/BregmanCorrentropy.

preprint2020arXiv

PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Although substantial efforts have been made to learn disentangled representations under the variational autoencoder (VAE) framework, the fundamental properties to the dynamics of learning of most VAE models still remain unknown and under-investigated. In this work, we first propose a novel learning objective, termed the principle-of-relevant-information variational autoencoder (PRI-VAE), to learn disentangled representations. We then present an information-theoretic perspective to analyze existing VAE models by inspecting the evolution of some critical information-theoretic quantities across training epochs. Our observations unveil some fundamental properties associated with VAEs. Empirical results also demonstrate the effectiveness of PRI-VAE on four benchmark data sets.

preprint2020arXiv

Understanding Convolutional Neural Networks with Information Theory: An Initial Exploration

The matrix-based Renyi's α-entropy functional and its multivariate extension were recently developed in terms of the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the utility and possible applications of these new estimators are rather new and mostly unknown to practitioners. In this paper, we first show that our estimators enable straightforward measurement of information flow in realistic convolutional neural networks (CNN) without any approximation. Then, we introduce the partial information decomposition (PID) framework and develop three quantities to analyze the synergy and redundancy in convolutional layer representations. Our results validate two fundamental data processing inequalities and reveal some fundamental properties concerning the training of CNN.