Research connected to &quot;machine learning&quot;

Bimodal Distribution Removal and Genetic Algorithm in Neural Network for Breast Cancer Diagnosis

Diagnosis of breast cancer has been well studied in the past. Multiple linear programming models have been devised to approximate the relationship between cell features and tumour malignancy. However, these models are less capable in handling non-linear correlations. Neural networks instead are powerful in processing complex non-linear correlations. It is thus certainly beneficial to approach this cancer diagnosis problem with a model based on neural network. Particularly, introducing bias to neural network training process is deemed as an important means to increase training efficiency. Out of a number of popular proposed methods for introducing artificial bias, Bimodal Distribution Removal (BDR) presents ideal efficiency improvement results and fair simplicity in implementation. However, this paper examines the effectiveness of BDR against the target cancer diagnosis classification problem and shows that BDR process in fact negatively impacts classification performance. In addition, this paper also explores genetic algorithm as an efficient tool for feature selection and produced significantly better results comparing to baseline model that without any feature selection in place

Fuzzy clustering algorithms with distance metric learning and entropy regularization

The clustering methods have been used in a variety of fields such as image processing, data mining, pattern recognition, and statistical analysis. Generally, the clustering algorithms consider all variables equally relevant or not correlated for the clustering task. Nevertheless, in real situations, some variables can be correlated or may be more or less relevant or even irrelevant for this task. This paper proposes partitioning fuzzy clustering algorithms based on Euclidean, City-block and Mahalanobis distances and entropy regularization. These methods are an iterative three steps algorithms which provide a fuzzy partition, a representative for each fuzzy cluster, and the relevance weight of the variables or their correlation by minimizing a suitable objective function. Several experiments on synthetic and real datasets, including its application to noisy image texture segmentation, demonstrate the usefulness of these adaptive clustering methods.

Self-Activating Neural Ensembles for Continual Reinforcement Learning

The ability for an agent to continuously learn new skills without catastrophically forgetting existing knowledge is of critical importance for the development of generally intelligent agents. Most methods devised to address this problem depend heavily on well-defined task boundaries, and thus depend on human supervision. Our task-agnostic method, Self-Activating Neural Ensembles (SANE), uses a modular architecture designed to avoid catastrophic forgetting without making any such assumptions. At the beginning of each trajectory, a module in the SANE ensemble is activated to determine the agent's next policy. During training, new modules are created as needed and only activated modules are updated to ensure that unused modules remain unchanged. This system enables our method to retain and leverage old skills, while growing and learning new ones. We demonstrate our approach on visually rich procedurally generated environments.

preprint2016arXiv

ROCS-Derived Features for Virtual Screening

Rapid overlay of chemical structures (ROCS) is a standard tool for the calculation of 3D shape and chemical ("color") similarity. ROCS uses unweighted sums to combine many aspects of similarity, yielding parameter-free models for virtual screening. In this report, we decompose the ROCS color force field into "color components" and "color atom overlaps", novel color similarity features that can be weighted in a system-specific manner by machine learning algorithms. In cross-validation experiments, these additional features significantly improve virtual screening performance (ROC AUC scores) relative to standard ROCS.

preprint2011arXiv

Iterative Reweighted Algorithms for Sparse Signal Recovery with Temporally Correlated Source Vectors

Iterative reweighted algorithms, as a class of algorithms for sparse signal recovery, have been found to have better performance than their non-reweighted counterparts. However, for solving the problem of multiple measurement vectors (MMVs), all the existing reweighted algorithms do not account for temporal correlation among source vectors and thus their performance degrades significantly in the presence of correlation. In this work we propose an iterative reweighted sparse Bayesian learning (SBL) algorithm exploiting the temporal correlation, and motivated by it, we propose a strategy to improve existing reweighted $\ell_2$ algorithms for the MMV problem, i.e. replacing their row norms with Mahalanobis distance measure. Simulations show that the proposed reweighted SBL algorithm has superior performance, and the proposed improvement strategy is effective for existing reweighted $\ell_2$ algorithms.

Revisiting the General Identifiability Problem

We revisit the problem of general identifiability originally introduced in [Lee et al., 2019] for causal inference and note that it is necessary to add positivity assumption of observational distribution to the original definition of the problem. We show that without such an assumption the rules of do-calculus and consequently the proposed algorithm in [Lee et al., 2019] are not sound. Moreover, adding the assumption will cause the completeness proof in [Lee et al., 2019] to fail. Under positivity assumption, we present a new algorithm that is provably both sound and complete. A nice property of this new algorithm is that it establishes a connection between general identifiability and classical identifiability by Pearl [1995] through decomposing the general identifiability problem into a series of classical identifiability sub-problems.

preprint2017arXiv

Iterative Random Forests to detect predictive and stable high-order interactions

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.

Robustness to Augmentations as a Generalization metric

Generalization is the ability of a model to predict on unseen domains and is a fundamental task in machine learning. Several generalization bounds, both theoretical and empirical have been proposed but they do not provide tight bounds .In this work, we propose a simple yet effective method to predict the generalization performance of a model by using the concept that models that are robust to augmentations are more generalizable than those which are not. We experiment with several augmentations and composition of augmentations to check the generalization capacity of a model. We also provide a detailed motivation behind the proposed method. The proposed generalization metric is calculated based on the change in the output of the model after augmenting the input. The proposed method was the first runner up solution for the NeurIPS competition on Predicting Generalization in Deep Learning.

Continual Normalization: Rethinking Batch Normalization for Online Continual Learning

Existing continual learning methods use Batch Normalization (BN) to facilitate training and improve generalization across tasks. However, the non-i.i.d and non-stationary nature of continual learning data, especially in the online setting, amplify the discrepancy between training and testing in BN and hinder the performance of older tasks. In this work, we study the cross-task normalization effect of BN in online continual learning where BN normalizes the testing data using moments biased towards the current task, resulting in higher catastrophic forgetting. This limitation motivates us to propose a simple yet effective method that we call Continual Normalization (CN) to facilitate training similar to BN while mitigating its negative effect. Extensive experiments on different continual learning algorithms and online scenarios show that CN is a direct replacement for BN and can provide substantial performance improvements. Our implementation is available at \url{https://github.com/phquang/Continual-Normalization}.

FOOD: Fast Out-Of-Distribution Detector

Deep neural networks (DNNs) perform well at classifying inputs associated with the classes they have been trained on, which are known as in distribution inputs. However, out-of-distribution (OOD) inputs pose a great challenge to DNNs and consequently represent a major risk when DNNs are implemented in safety-critical systems. Extensive research has been performed in the domain of OOD detection. However, current state-of-the-art methods for OOD detection suffer from at least one of the following limitations: (1) increased inference time - this limits existing methods' applicability to many real-world applications, and (2) the need for OOD training data - such data can be difficult to acquire and may not be representative enough, thus limiting the ability of the OOD detector to generalize. In this paper, we propose FOOD -- Fast Out-Of-Distribution detector -- an extended DNN classifier capable of efficiently detecting OOD samples with minimal inference time overhead. Our architecture features a DNN with a final Gaussian layer combined with the log likelihood ratio statistical test and an additional output neuron for OOD detection. Instead of using real OOD data, we use a novel method to craft artificial OOD samples from in-distribution data, which are used to train our OOD detector neuron. We evaluate FOOD's detection performance on the SVHN, CIFAR-10, and CIFAR-100 datasets. Our results demonstrate that in addition to achieving state-of-the-art performance, FOOD is fast and applicable to real-world applications.

Unbiased Asymmetric Reinforcement Learning under Partial Observability

In partially observable reinforcement learning, offline training gives access to latent information which is not available during online training and/or execution, such as the system state. Asymmetric actor-critic methods exploit such information by training a history-based policy via a state-based critic. However, many asymmetric methods lack theoretical foundation, and are only evaluated on limited domains. We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamental issues which undermine the validity of a common variant, and limit its ability to address partial observability. We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process. An empirical evaluation performed on domains which exhibit significant partial observability confirms our analysis, demonstrating that unbiased asymmetric actor-critic converges to better policies and/or faster than symmetric and biased asymmetric baselines.

preprint2012arXiv

Inferring deterministic causal relations

We consider two variables that are related to each other by an invertible function. While it has previously been shown that the dependence structure of the noise can provide hints to determine which of the two variables is the cause, we presently show that even in the deterministic (noise-free) case, there are asymmetries that can be exploited for causal inference. Our method is based on the idea that if the function and the probability density of the cause are chosen independently, then the distribution of the effect will, in a certain sense, depend on the function. We provide a theoretical analysis of this method, showing that it also works in the low noise regime, and link it to information geometry. We report strong empirical results on various real-world data sets from different domains.

Applying Cyclical Learning Rate to Neural Machine Translation

In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy for computer vision related convolutional networks and datasets, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks. Thus with our work, we hope to raise awareness of the importance of selecting the right optimizers and the accompanying learning rate policy, at the same time, encourage further research into easy-to-use learning rate policies.

preprint2026arXiv

Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.

preprint2014arXiv

Learning Boolean Halfspaces with Small Weights from Membership Queries

We consider the problem of proper learning a Boolean Halfspace with integer weights $\{0,1,\ldots,t\}$ from membership queries only. The best known algorithm for this problem is an adaptive algorithm that asks $n^{O(t^5)}$ membership queries where the best lower bound for the number of membership queries is $n^t$ [Learning Threshold Functions with Small Weights Using Membership Queries. COLT 1999] In this paper we close this gap and give an adaptive proper learning algorithm with two rounds that asks $n^{O(t)}$ membership queries. We also give a non-adaptive proper learning algorithm that asks $n^{O(t^3)}$ membership queries.

Learning to Liquidate Forex: Optimal Stopping via Adaptive Top-K Regression

We consider learning a trading agent acting on behalf of the treasury of a firm earning revenue in a foreign currency (FC) and incurring expenses in the home currency (HC). The goal of the agent is to maximize the expected HC at the end of the trading episode by deciding to hold or sell the FC at each time step in the trading episode. We pose this as an optimization problem, and consider a broad spectrum of approaches with the learning component ranging from supervised to imitation to reinforcement learning. We observe that most of the approaches considered struggle to improve upon simple heuristic baselines. We identify two key aspects of the problem that render standard solutions ineffective - i) while good forecasts of future FX rates can be highly effective in guiding good decisions, forecasting FX rates is difficult, and erroneous estimates tend to degrade the performance of trading agents instead of improving it, ii) the inherent non-stationary nature of FX rates renders a fixed decision-threshold highly ineffective. To address these problems, we propose a novel supervised learning approach that learns to forecast the top-K future FX rates instead of forecasting all the future FX rates, and bases the hold-versus-sell decision on the forecasts (e.g. hold if future FX rate is higher than current FX rate, sell otherwise). Furthermore, to handle the non-stationarity in the FX rates data which poses challenges to the i.i.d. assumption in supervised learning methods, we propose to adaptively learn decision-thresholds based on recent historical episodes. Through extensive empirical evaluation, we show that our approach is the only approach which is able to consistently improve upon a simple heuristic baseline. Further experiments show the inefficacy of state-of-the-art statistical and deep-learning-based forecasting methods as they degrade the performance of the trading agent.

Model-Centric and Data-Centric Aspects of Active Learning for Deep Neural Networks

We study different aspects of active learning with deep neural networks in a consistent and unified way. i) We investigate incremental and cumulative training modes which specify how the newly labeled data are used for training. ii) We study active learning w.r.t. the model configurations such as the number of epochs and neurons as well as the choice of batch size. iii) We consider in detail the behavior of query strategies and their corresponding informativeness measures and accordingly propose more efficient querying procedures. iv) We perform statistical analyses, e.g., on actively learned classes and test error estimation, that reveal several insights about active learning. v) We investigate how active learning with neural networks can benefit from pseudo-labels as proxies for actual labels.

Analysis via Orthonormal Systems in Reproducing Kernel Hilbert $C^*$-Modules and Applications

Kernel methods have been among the most popular techniques in machine learning, where learning tasks are solved using the property of reproducing kernel Hilbert space (RKHS). In this paper, we propose a novel data analysis framework with reproducing kernel Hilbert $C^*$-module (RKHM), which is another generalization of RKHS than vector-valued RKHS (vv-RKHS). Analysis with RKHMs enables us to deal with structures among variables more explicitly than vv-RKHS. We show the theoretical validity for the construction of orthonormal systems in Hilbert $C^*$-modules, and derive concrete procedures for orthonormalization in RKHMs with those theoretical properties in numerical computations. Moreover, we apply those to generalize with RKHM kernel principal component analysis and the analysis of dynamical systems with Perron-Frobenius operators. The empirical performance of our methods is also investigated by using synthetic and real-world data.

Graph Neural Networks to Predict Customer Satisfaction Following Interactions with a Corporate Call Center

Customer satisfaction is an important factor in creating and maintaining long-term relationships with customers. Near real-time identification of potentially dissatisfied customers following phone calls can provide organizations the opportunity to take meaningful interventions and to foster ongoing customer satisfaction and loyalty. This work describes a fully operational system we have developed at a large US company for predicting customer satisfaction following incoming phone calls. The system takes as an input speech-to-text transcriptions of calls and predicts call satisfaction reported by customers on post-call surveys (scale from 1 to 10). Because of its ordinal, subjective, and often highly-skewed nature, predicting survey scores is not a trivial task and presents several modeling challenges. We introduce a graph neural network (GNN) approach that takes into account the comparative nature of the problem by considering the relative scores among batches, instead of only pairs of calls when training. This approach produces more accurate predictions than previous approaches including standard regression and classification models that directly fit the survey scores with call data. Our proposed approach can be easily generalized to other customer satisfaction prediction problems.

Adversaries in Online Learning Revisited: with applications in Robust Optimization and Adversarial training

We revisit the concept of "adversary" in online learning, motivated by solving robust optimization and adversarial training using online learning methods. While one of the classical setups in online learning deals with the "adversarial" setup, it appears that this concept is used less rigorously, causing confusion in applying results and insights from online learning. Specifically, there are two fundamentally different types of adversaries, depending on whether the "adversary" is able to anticipate the exogenous randomness of the online learning algorithms. This is particularly relevant to robust optimization and adversarial training because the adversarial sequences are often anticipative, and many online learning algorithms do not achieve diminishing regret in such a case. We then apply this to solving robust optimization problems or (equivalently) adversarial training problems via online learning and establish a general approach for a large variety of problem classes using imaginary play. Here two players play against each other, the primal player playing the decisions and the dual player playing realizations of uncertain data. When the game terminates, the primal player has obtained an approximately robust solution. This meta-game allows for solving a large variety of robust optimization and multi-objective optimization problems and generalizes the approach of arXiv:1402.6361.

Zero-Bias Deep Learning for Accurate Identification of Internet of Things (IoT) Devices

The Internet of Things (IoT) provides applications and services that would otherwise not be possible. However, the open nature of IoT make it vulnerable to cybersecurity threats. Especially, identity spoofing attacks, where an adversary passively listens to existing radio communications and then mimic the identity of legitimate devices to conduct malicious activities. Existing solutions employ cryptographic signatures to verify the trustworthiness of received information. In prevalent IoT, secret keys for cryptography can potentially be disclosed and disable the verification mechanism. Non-cryptographic device verification is needed to ensure trustworthy IoT. In this paper, we propose an enhanced deep learning framework for IoT device identification using physical layer signals. Specifically, we enable our framework to report unseen IoT devices and introduce the zero-bias layer to deep neural networks to increase robustness and interpretability. We have evaluated the effectiveness of the proposed framework using real data from ADS-B (Automatic Dependent Surveillance-Broadcast), an application of IoT in aviation. The proposed framework has the potential to be applied to accurate ide

Graph-Aided Online Multi-Kernel Learning

Multi-kernel learning (MKL) has been widely used in function approximation tasks. The key problem of MKL is to combine kernels in a prescribed dictionary. Inclusion of irrelevant kernels in the dictionary can deteriorate accuracy of MKL, and increase the computational complexity. To improve the accuracy of function approximation and reduce the computational complexity, the present paper studies data-driven selection of kernels from the dictionary that provide satisfactory function approximations. Specifically, based on the similarities among kernels, the novel framework constructs and refines a graph to assist choosing a subset of kernels. In addition, random feature approximation is utilized to enable online implementation for sequentially obtained data. Theoretical analysis shows that our proposed algorithms enjoy tighter sub-linear regret bound compared with state-of-art graph-based online MKL alternatives. Experiments on a number of real datasets also showcase the advantages of our novel graph-aided framework.

Positive Feature Values Prioritized Hierarchical Redundancy Eliminated Tree Augmented Naive Bayes Classifier for Hierarchical Feature Spaces

The Hierarchical Redundancy Eliminated Tree Augmented Naive Bayes (HRE-TAN) classifier is a semi-naive Bayesian model that learns a type of hierarchical redundancy-free tree-like feature representation to estimate the data distribution. In this work, we propose two new types of positive feature values prioritized hierarchical redundancy eliminated tree augmented naive Bayes classifiers that focus on features bearing positive instance values. The two newly proposed methods are applied to 28 real-world bioinformatics datasets showing better predictive performance than the conventional HRE-TAN classifier.

preprint2011arXiv

Distribution-Independent Evolvability of Linear Threshold Functions

Valiant's (2007) model of evolvability models the evolutionary process of acquiring useful functionality as a restricted form of learning from random examples. Linear threshold functions and their various subclasses, such as conjunctions and decision lists, play a fundamental role in learning theory and hence their evolvability has been the primary focus of research on Valiant's framework (2007). One of the main open problems regarding the model is whether conjunctions are evolvable distribution-independently (Feldman and Valiant, 2008). We show that the answer is negative. Our proof is based on a new combinatorial parameter of a concept class that lower-bounds the complexity of learning from correlations. We contrast the lower bound with a proof that linear threshold functions having a non-negligible margin on the data points are evolvable distribution-independently via a simple mutation algorithm. Our algorithm relies on a non-linear loss function being used to select the hypotheses instead of 0-1 loss in Valiant's (2007) original definition. The proof of evolvability requires that the loss function satisfies several mild conditions that are, for example, satisfied by

Bayesian Learning Approach to Model Predictive Control

This study presents a Bayesian learning perspective towards model predictive control algorithms. High-level frameworks have been developed separately in the earlier studies on Bayesian learning and sampling-based model predictive control. On one hand, the Bayesian learning rule provides a general framework capable of generating various machine learning algorithms as special instances. On the other hand, the dynamic mirror descent model predictive control framework is capable of diversifying sample-rollout-based control algorithms. However, connections between the two frameworks have still not been fully appreciated in the context of stochastic optimal control. This study combines the Bayesian learning rule point of view into the model predictive control setting by taking inspirations from the view of understanding model predictive controller as an online learner. The selection of posterior class and natural gradient approximation for the variational formulation governs diversification of model predictive control algorithms in the Bayesian learning approach to model predictive control. This alternative viewpoint complements the dynamic mirror descent framework through streamlining the explanation of design choices.

Make Hawkes Processes Explainable by Decomposing Self-Triggering Kernels

Hawkes Processes capture self-excitation and mutual-excitation between events when the arrival of an event makes future events more likely to happen. Identification of such temporal covariance can reveal the underlying structure to better predict future events. In this paper, we present a new framework to decompose discrete events with a composition of multiple self-triggering kernels. The composition scheme allows us to decompose empirical covariance densities into the sum or the product of base kernels which are easily interpretable. Here, we present the first multiplicative kernel composition methods for Hawkes Processes. We demonstrate that the new automatic kernel decomposition procedure outperforms the existing methods on the prediction of discrete events in real-world data.

Volume-Centred Range Bars: Novel Interpretable Representation of Financial Markets Designed for Machine Learning Applications

Financial markets are a source of non-stationary multidimensional time series which has been drawing attention for decades. Each financial instrument has its specific changing-over-time properties, making its analysis a complex task. Hence, improvement of understanding and development of more informative, generalisable market representations are essential for the successful operation in financial markets, including risk assessment, diversification, trading, and order execution. In this study, we propose a volume-price-based market representation for making financial time series more suitable for machine learning pipelines. We use a statistical approach for evaluating the representation. Through the research questions, we investigate, i) whether the proposed representation allows the more efficient design of machine learning models; ii) whether the proposed representation leads to increased performance over the price levels market pattern; iii) whether the proposed representation performs better on the liquid markets, and iv) whether SHAP feature interactions are reliable to be used in the considered setting. Our analysis shows that the proposed volume-based method allows successful classification of the financial time series patterns, and also leads to better classification performance than the price levels-based method, excelling specifically on more liquid financial instruments. Finally, we propose an approach for obtaining feature interactions directly from tree-based models and compare the outcomes to those of the SHAP method. This results in the significant similarity between the two methods, hence we claim that SHAP feature interactions are reliable to be used in the setting of financial markets.

Optimistic Optimization of Gaussian Process Samples

Bayesian optimization is a popular formalism for global optimization, but its computational costs limit it to expensive-to-evaluate functions. A competing, computationally more efficient, global optimization framework is optimistic optimization, which exploits prior knowledge about the geometry of the search space in form of a dissimilarity function. We investigate to which degree the conceptual advantages of Bayesian Optimization can be combined with the computational efficiency of optimistic optimization. By mapping the kernel to a dissimilarity, we obtain an optimistic optimization algorithm for the Bayesian Optimization setting with a run-time of up to $\mathcal{O}(N \log N)$. As a high-level take-away we find that, when using stationary kernels on objectives of relatively low evaluation cost, optimistic optimization can be strongly preferable over Bayesian optimization, while for strongly coupled and parametric models, good implementations of Bayesian optimization can perform much better, even at low evaluation cost. We argue that there is a new research domain between geometric and probabilistic search, i.e. methods that run drastically faster than traditional Bayesian optimization, while retaining some of the crucial functionality of Bayesian optimization.

preprint2026arXiv

Proximal Projection for Doubly Sparse Regularized Models

Regularization is often used in high-dimensional regression settings to generate a sparse model, which can save tremendous computing resources and identify predictors that are most strongly associated with the response. When the predictors can be represented by a Gaussian graphical model, the structure of the predictor graph can be exploited during regularization. Our proposed model exploits this underlying predictor graph structure by decomposing the estimated coefficient vector into a sum of latent variables that correspond to the sum of each node contribution to the coefficient vector. Regularization is then performed on the latent variables rather than on the coefficient vector directly. We use a penalty function that permits a clear user-defined trade-off between the L1 and L2 penalties and propose a novel proximal projection during optimization. Further, our implementation computes the projection operator for the intersection of selected groups, which conserves more computing resources compared to predictor duplication methods, especially for high-dimensional data. Through simulation, we evaluate the performance of our approach under different graph structures and node counts, and present results on real-world data. Results suggest that our method exhibits stable performance relative to other singly or doubly sparse graphical regression models.

Fundamental limits for rank-one matrix estimation with groupwise heteroskedasticity

Low-rank matrix recovery problems involving high-dimensional and heterogeneous data appear in applications throughout statistics and machine learning. The contribution of this paper is to establish the fundamental limits of recovery for a broad class of these problems. In particular, we study the problem of estimating a rank-one matrix from Gaussian observations where different blocks of the matrix are observed under different noise levels. In the setting where the number of blocks is fixed while the number of variables tends to infinity, we prove asymptotically exact formulas for the minimum mean-squared error in estimating both the matrix and underlying factors. These results are based on a novel reduction from the low-rank matrix tensor product model (with homogeneous noise) to a rank-one model with heteroskedastic noise. As an application of our main result, we show that recently proposed methods based on applying principal component analysis (PCA) to weighted combinations of the data are optimal in some settings but sub-optimal in others. We also provide numerical results comparing our asymptotic formulas with the performance of methods based on weighted PCA, gradient descent, and approximate message passing.