Source author record

Thuan Nguyen

Thuan Nguyen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT eess.SP Information Retrieval Machine Learning Artificial Intelligence Methodology Computation Computer Vision

Catalog footprint

What is connected

12works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets

In this paper, we provide a computable characterization of the geometry of optimal representations in Contrastive Learning (CL) when the classes are imbalanced. When classes are balanced and the representation dimension is greater than the number of classes, it is well-known that the optimal representations exhibit Neural Collapse (NC), i.e., representations from the same class collapse to their class means and the class means form an Equiangular Tight Frame (ETF). For imbalanced classes and a large, generalized family of CL losses, we prove that the optimal representations of all samples from the same class collapse to their class means and their geometry exhibits an angular symmetry structure that is determined by the relative class proportions. In general, we show that the geometry can be determined by solving a convex optimization problem. Exploiting this symmetry structure, we analytically investigate a special case where class imbalance is extreme and prove that CL exhibits a phenomenon called Minority Collapse (MC) where all samples from the minority classes (classes with small probabilities) collapse into a single vector, whenever the class imbalance exceeds a threshold, which in turn depends on the regularity properties of the CL loss used and on the number of negative samples. Numerical results are provided to illustrate these phenomena and corroborate the theoretical results. We conclude by identifying a number of open problems.

preprint2025arXiv

A Random-Effects Approach to Generalized Linear Mixed Model Analysis of Incomplete Longitudinal Data

We propose a random-effects approach to missing values for generalized linear mixed model (GLMM) analysis. The method converts a GLMM with missing covariates to another GLMM without missing covariates. The standard GLMM analysis tools for longitudinal data then apply. The method applies, in particular, to the cases of linear mixed models and logistic regression. Performance of the method is evaluated empirically, and compared with alternative approaches, including the popular MICE procedure of multiple imputation. Theoretical justification of the method is given, and explained, for the patterns observed in the simulation studies. Two real-data examples from healthcare studies are discussed.

preprint2022arXiv

Conditional entropy minimization principle for learning domain invariant representation features

Invariance-principle-based methods such as Invariant Risk Minimization (IRM), have recently emerged as promising approaches for Domain Generalization (DG). Despite promising theory, such approaches fail in common classification tasks due to the mixing of true invariant features and spurious invariant features. To address this, we propose a framework based on the conditional entropy minimization (CEM) principle to filter-out the spurious invariant features leading to a new algorithm with a better generalization capability. We show that our proposed approach is closely related to the well-known Information Bottleneck (IB) framework and prove that under certain assumptions, entropy minimization can exactly recover the true invariant features. Our approach provides competitive classification accuracy compared to recent theoretically-principled state-of-the-art alternatives across several DG datasets.

preprint2022arXiv

Joint covariate-alignment and concept-alignment: a framework for domain generalization

In this paper, we propose a novel domain generalization (DG) framework based on a new upper bound to the risk on the unseen domain. Particularly, our framework proposes to jointly minimize both the covariate-shift as well as the concept-shift between the seen domains for a better performance on the unseen domain. While the proposed approach can be implemented via an arbitrary combination of covariate-alignment and concept-alignment modules, in this work we use well-established approaches for distributional alignment namely, Maximum Mean Discrepancy (MMD) and covariance Alignment (CORAL), and use an Invariant Risk Minimization (IRM)-based approach for concept alignment. Our numerical results show that the proposed methods perform as well as or better than the state-of-the-art for domain generalization on several data sets.

preprint2020arXiv

Communication-Channel Optimized Partition

Given an original discrete source X with the distribution p_X that is corrupted by noise to produce the noisy data Y with the given joint distribution p(X, Y). A quantizer/classifier Q : Y -> Z is then used to classify/quantize the data Y to the discrete partitioned output Z with probability distribution p_Z. Next, Z is transmitted over a deterministic channel with a given channel matrix A that produces the final discrete output T. One wants to design the optimal quantizer/classifier Q^* such that the cost function F(X; T) between the input X and the final output T is minimized while the probability of the partitioned output Z satisfies a concave constraint G(p_Z) < C. Our results generalized some famous previous results. First, an iteration linear time complexity algorithm is proposed to find the local optimal quantizer. Second, we show that the optimal partition should produce a hard partition that is equivalent to the cuts by hyper-planes in the probability space of the posterior probability p(X|Y). This result finally provides a polynomial-time algorithm to find the globally optimal quantizer.

preprint2020arXiv

Entropy-Constrained Maximizing Mutual Information Quantization

In this paper, we investigate the quantization of the output of a binary input discrete memoryless channel that maximizing the mutual information between the input and the quantized output under an entropy-constrained of the quantized output. A polynomial time algorithm is introduced that can find the truly global optimal quantizer. These results hold for binary input channels with an arbitrary number of quantized output. Finally, we extend these results to binary input continuous output channels and show a sufficient condition such that a single threshold quantizer is an optimal quantizer. Both theoretical results and numerical results are provided to justify our techniques.

preprint2020arXiv

On Bounds and Closed Form Expressions for Capacities of Discrete Memoryless Channels with Invertible Positive Matrices

While capacities of discrete memoryless channels are well studied, it is still not possible to obtain a closed-form expression for the capacity of an arbitrary discrete memoryless channel. This paper describes an elementary technique based on Karush Kuhn Tucker (KKT) conditions to obtain (1) a good upper bound of a discrete memoryless channel having an invertible positive channel matrix and (2) a closed-form expression for the capacity if the channel matrix satisfies certain conditions related to its singular value and its Gershgorin disk.

preprint2020arXiv

On the Uniqueness of Binary Quantizers for Maximizing Mutual Information

We consider a channel with a binary input X being corrupted by a continuous-valued noise that results in a continuous-valued output Y. An optimal binary quantizer is used to quantize the continuous-valued output Y to the final binary output Z to maximize the mutual information I(X; Z). We show that when the ratio of the channel conditional density r(y) = P(Y=y|X=0)/ P(Y =y|X=1) is a strictly increasing/decreasing function of y, then a quantizer having a single threshold can maximize mutual information. Furthermore, we show that an optimal quantizer (possibly with multiple thresholds) is the one with the thresholding vector whose elements are all the solutions of r(y) = r* for some constant r* > 0. Interestingly, the optimal constant r* is unique. This uniqueness property allows for fast algorithmic implementation such as a bisection algorithm to find the optimal quantizer. Our results also confirm some previous results using alternative elementary proofs. We show some numerical examples of applying our results to channels with additive Gaussian noises.

preprint2020arXiv

Optimal quantizer structure for binary discrete input continuous output channels under an arbitrary quantized-output constraint

Given a channel having binary input X = (x_1, x_2) having the probability distribution p_X = (p_{x_1}, p_{x_2}) that is corrupted by a continuous noise to produce a continuous output y \in Y = R. For a given conditional distribution p(y|x_1) = ϕ_1(y) and p(y|x_2) = ϕ_2(y), one wants to quantize the continuous output y back to the final discrete output Z = (z_1, z_2, ..., z_N) with N \leq 2 such that the mutual information between input and quantized-output I(X; Z) is maximized while the probability of the quantized-output p_Z = (p_{z_1}, p_{z_2}, ..., p_{z_N}) has to satisfy a certain constraint. Consider a new variable r_y=p_{x_1}ϕ_1(y)/ (p_{x_1}ϕ_1(y)+p_{x_2}ϕ_2(y)), we show that the optimal quantizer has a structure of convex cells in the new variable r_y. Based on the convex cells property of the optimal quantizers, a fast algorithm is proposed to find the global optimal quantizer in a polynomial time complexity.

preprint2020arXiv

Single-bit Quantization Capacity of Binary-input Continuous-output Channels

We consider a channel with discrete binary input X that is corrupted by a given continuous noise to produce a continuous-valued output Y. A quantizer is then used to quantize the continuous-valued output Y to the final binary output Z. The goal is to design an optimal quantizer Q* and also find the optimal input distribution p*(X) that maximizes the mutual information I(X; Z) between the binary input and the binary quantized output. A linear time complexity searching procedure is proposed. Based on the properties of the optimal quantizer and the optimal input distribution, we reduced the searching range that results in a faster implementation algorithm. Both theoretical and numerical results are provided to illustrate our method.

preprint2019arXiv

Minimizing Impurity Partition Under Constraints

Set partitioning is a key component of many algorithms in machine learning, signal processing, and communications. In general, the problem of finding a partition that minimizes a given impurity (loss function) is NP-hard. As such, there exists a wealth of literature on approximate algorithms and theoretical analyses of the partitioning problem under different settings. In this paper, we formulate and solve a variant of the partition problem called the minimum impurity partition under constraint (MIPUC). MIPUC finds an optimal partition that minimizes a given loss function under a given concave constraint. MIPUC generalizes the recently proposed deterministic information bottleneck problem which finds an optimal partition that maximizes the mutual information between the input and partition output while minimizing the partition output entropy. Our proposed algorithm is developed based on a novel optimality condition, which allows us to find a locally optimal solution efficiently. Moreover, we show that the optimal partition produces a hard partition that is equivalent to the cuts by hyperplanes in the probability space of the posterior probability that finally yields a polynomial time complexity algorithm to find the globally optimal partition. Both theoretical and numerical results are provided to validate the proposed algorithm.

preprint2016arXiv

A Unified Monte-Carlo Jackknife for Small Area Estimation after Model Selection

We consider estimation of measure of uncertainty in small area estimation (SAE) when a procedure of model selection is involved prior to the estimation. A unified Monte-Carlo jackknife method, called McJack, is proposed for estimating the logarithm of the mean squared prediction error. We prove the second-order unbiasedness of McJack, and demonstrate the performance of McJack in assessing uncertainty in SAE after model selection through empirical investigations that include simulation studies and real-data analyses.

Thuan Nguyen

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets

A Random-Effects Approach to Generalized Linear Mixed Model Analysis of Incomplete Longitudinal Data

Conditional entropy minimization principle for learning domain invariant representation features

Joint covariate-alignment and concept-alignment: a framework for domain generalization

Communication-Channel Optimized Partition

Entropy-Constrained Maximizing Mutual Information Quantization

On Bounds and Closed Form Expressions for Capacities of Discrete Memoryless Channels with Invertible Positive Matrices

On the Uniqueness of Binary Quantizers for Maximizing Mutual Information

Optimal quantizer structure for binary discrete input continuous output channels under an arbitrary quantized-output constraint

Single-bit Quantization Capacity of Binary-input Continuous-output Channels

Minimizing Impurity Partition Under Constraints

A Unified Monte-Carlo Jackknife for Small Area Estimation after Model Selection