Research connected to "machine learning"

Search papers, authors, topics, institutions and opportunities, then move straight into the graph around the result.

FiltersOptional

Search results

Showing works 2,017-2,048 from 49,008 works in Machine Learning. Use pages to browse more, or open the graph for the map.

49,008matching works
Full topic scaleMachine Learning

49,008 works and 109,744 authors are indexed for this topic. This page shows 32 works at a time so search stays fast.

Match modeExact match focus
Semantic hits0
Active filters0
Graph viewOpen

Papers

preprint2016arXiv

High Dimensional Multivariate Regression and Precision Matrix Estimation via Nonconvex Optimization

We propose a nonconvex estimator for joint multivariate regression and precision matrix estimation in the high dimensional regime, under sparsity constraints. A gradient descent algorithm with hard thresholding is developed to solve the nonconvex estimator, and it attains a linear rate of convergence to the true regression coefficients and precision matrix simultaneously, up to the statistical error. Compared with existing methods along this line of research, which have little theoretical guarantee, the proposed algorithm not only is computationally much more efficient with provable convergence guarantee, but also attains the optimal finite sample statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real datasets back up our theory.

preprint2022arXiv

TimeREISE: Time-series Randomized Evolving Input Sample Explanation

Deep neural networks are one of the most successful classifiers across different domains. However, due to their limitations concerning interpretability their use is limited in safety critical context. The research field of explainable artificial intelligence addresses this problem. However, most of the interpretability methods are aligned to the image modality by design. The paper introduces TimeREISE a model agnostic attribution method specifically aligned to success in the context of time series classification. The method shows superior performance compared to existing approaches concerning different well-established measurements. TimeREISE is applicable to any time series classification network, its runtime does not scale in a linear manner concerning the input shape and it does not rely on prior data knowledge.

preprint2026arXiv

It Just Takes Two: Scaling Amortized Inference to Large Sets

Neural posterior estimation has emerged as a powerful tool for amortized inference, with growing adoption across scientific and applied domains. In many of these applications, the conditioning variable is a set of observations whose elements depend not only on the target but also on unknown factors shared across the set. Optimal inference therefore requires treating the set jointly, which in turn requires training the estimator at the deployment set size -- a regime where memory and compute quickly become prohibitive. We introduce a simple, theoretically grounded strategy that decouples representation learning from posterior modeling. Our method trains a mean-pool Deep Set on sets of size at most two, producing an encoder that generalizes to arbitrary set sizes. The inference head is then finetuned on pre-aggregated embeddings, making training cost essentially independent of the deployment set size N. Across scalar, image, multi-view 3D, molecular, and high-dimensional conditional generation benchmarks with N in the thousands, our approach matches or outperforms standard baselines at a fraction of the compute.

preprint2015arXiv

Asymptotic Model Selection for Directed Networks with Hidden Variables

We extend the Bayesian Information Criterion (BIC), an asymptotic approximation for the marginal likelihood, to Bayesian networks with hidden variables. This approximation can be used to select models given large samples of data. The standard BIC as well as our extension punishes the complexity of a model according to the dimension of its parameters. We argue that the dimension of a Bayesian network with hidden variables is the rank of the Jacobian matrix of the transformation between the parameters of the network and the parameters of the observable variables. We compute the dimensions of several networks including the naive Bayes model with a hidden root node.

preprint2022arXiv

Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations

Estimating counterfactual outcomes over time has the potential to unlock personalized healthcare by assisting decision-makers to answer ''what-iF'' questions. Existing causal inference approaches typically consider regular, discrete-time intervals between observations and treatment decisions and hence are unable to naturally model irregularly sampled data, which is the common setting in practice. To handle arbitrary observation patterns, we interpret the data as samples from an underlying continuous-time process and propose to model its latent trajectory explicitly using the mathematics of controlled differential equations. This leads to a new approach, the Treatment Effect Neural Controlled Differential Equation (TE-CDE), that allows the potential outcomes to be evaluated at any time point. In addition, adversarial training is used to adjust for time-dependent confounding which is critical in longitudinal settings and is an added challenge not encountered in conventional time-series. To assess solutions to this problem, we propose a controllable simulation environment based on a model of tumor growth for a range of scenarios with irregular sampling reflective of a variety of clinical scenarios. TE-CDE consistently outperforms existing approaches in all simulated scenarios with irregular sampling.

preprint2026arXiv

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.

preprint2022arXiv

Rethinking Saliency Map: An Context-aware Perturbation Method to Explain EEG-based Deep Learning Model

Deep learning is widely used to decode the electroencephalogram (EEG) signal. However, there are few attempts to specifically investigate how to explain the EEG-based deep learning models. We conduct a review to summarize the existing works explaining the EEG-based deep learning model. Unfortunately, we find that there is no appropriate method to explain them. Based on the characteristic of EEG data, we suggest a context-aware perturbation method to generate a saliency map from the perspective of the raw EEG signal. Moreover, we also justify that the context information can be used to suppress the artifacts in the EEG-based deep learning model. In practice, some users might want a simple version of the explanation, which only indicates a few features as salient points. To this end, we propose an optional area limitation strategy to restrict the highlighted region. To validate our idea and make a comparison with the other methods, we select three representative EEG-based models to implement experiments on the emotional EEG dataset DEAP. The results of the experiments support the advantages of our method.

preprint2012arXiv

Optimal Weighting of Multi-View Data with Low Dimensional Hidden States

In Natural Language Processing (NLP) tasks, data often has the following two properties: First, data can be chopped into multi-views which has been successfully used for dimension reduction purposes. For example, in topic classification, every paper can be chopped into the title, the main text and the references. However, it is common that some of the views are less noisier than other views for supervised learning problems. Second, unlabeled data are easy to obtain while labeled data are relatively rare. For example, articles occurred on New York Times in recent 10 years are easy to grab but having them classified as 'Politics', 'Finance' or 'Sports' need human labor. Hence less noisy features are preferred before running supervised learning methods. In this paper we propose an unsupervised algorithm which optimally weights features from different views when these views are generated from a low dimensional hidden state, which occurs in widely used models like Mixture Gaussian Model, Hidden Markov Model (HMM) and Latent Dirichlet Allocation (LDA).

preprint2022arXiv

Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time

We propose an input sparsity time sampling algorithm that can spectrally approximate the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices using a nearly optimal number of samples, improving upon all previously known methods by poly$(q)$ factors. Furthermore, for the important special case of the $q$-fold self-tensoring of a dataset, which is the feature matrix of the degree-$q$ polynomial kernel, the leading term of our method's runtime is proportional to the size of the input dataset and has no dependence on $q$. Previous techniques either incur poly$(q)$ slowdowns in their runtime or remove the dependence on $q$ at the expense of having sub-optimal target dimension, and depend quadratically on the number of data-points in their runtime. Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time that only depends on the size of $X$, and at the same time their $q$-fold Kronecker product acts as a near-isometry for any fixed vector in the column span of $X^{\otimes q}$. We also show that our sampling methods generalize to other classes of kernels beyond polynomial, such as Gaussian and Neural Tangent kernels.

preprint2014arXiv

Riffled Independence for Efficient Inference with Partial Rankings

Distributions over rankings are used to model data in a multitude of real world settings such as preference analysis and political elections. Modeling such distributions presents several computational challenges, however, due to the factorial size of the set of rankings over an item set. Some of these challenges are quite familiar to the artificial intelligence community, such as how to compactly represent a distribution over a combinatorially large space, and how to efficiently perform probabilistic inference with these representations. With respect to ranking, however, there is the additional challenge of what we refer to as human task complexity users are rarely willing to provide a full ranking over a long list of candidates, instead often preferring to provide partial ranking information. Simultaneously addressing all of these challenges i.e., designing a compactly representable model which is amenable to efficient inference and can be learned using partial ranking data is a difficult task, but is necessary if we would like to scale to problems with nontrivial size. In this paper, we show that the recently proposed riffled independence assumptions cleanly and efficiently addre

preprint2021arXiv

Robust Single-step Adversarial Training with Regularizer

High cost of training time caused by multi-step adversarial example generation is a major challenge in adversarial training. Previous methods try to reduce the computational burden of adversarial training using single-step adversarial example generation schemes, which can effectively improve the efficiency but also introduce the problem of catastrophic overfitting, where the robust accuracy against Fast Gradient Sign Method (FGSM) can achieve nearby 100\% whereas the robust accuracy against Projected Gradient Descent (PGD) suddenly drops to 0\% over a single epoch. To address this problem, we propose a novel Fast Gradient Sign Method with PGD Regularization (FGSMPR) to boost the efficiency of adversarial training without catastrophic overfitting. Our core idea is that single-step adversarial training can not learn robust internal representations of FGSM and PGD adversarial examples. Therefore, we design a PGD regularization term to encourage similar embeddings of FGSM and PGD adversarial examples. The experiments demonstrate that our proposed method can train a robust deep network for L$_\infty$-perturbations with FGSM adversarial training and reduce the gap to multi-step adversarial training.

preprint2021arXiv

Modeling Household Online Shopping Demand in the U.S.: A Machine Learning Approach and Comparative Investigation between 2009 and 2017

Despite the rapid growth of online shopping and research interest in the relationship between online and in-store shopping, national-level modeling and investigation of the demand for online shopping with a prediction focus remain limited in the literature. This paper differs from prior work and leverages two recent releases of the U.S. National Household Travel Survey (NHTS) data for 2009 and 2017 to develop machine learning (ML) models, specifically gradient boosting machine (GBM), for predicting household-level online shopping purchases. The NHTS data allow for not only conducting nationwide investigation but also at the level of households, which is more appropriate than at the individual level given the connected consumption and shopping needs of members in a household. We follow a systematic procedure for model development including employing Recursive Feature Elimination algorithm to select input variables (features) in order to reduce the risk of model overfitting and increase model explainability. Extensive post-modeling investigation is conducted in a comparative manner between 2009 and 2017, including quantifying the importance of each input variable in predicting online shopping demand, and characterizing value-dependent relationships between demand and the input variables. In doing so, two latest advances in machine learning techniques, namely Shapley value-based feature importance and Accumulated Local Effects plots, are adopted to overcome inherent drawbacks of the popular techniques in current ML modeling. The modeling and investigation are performed both at the national level and for three of the largest cities (New York, Los Angeles, and Houston). The models developed and insights gained can be used for online shopping-related freight demand generation and may also be considered for evaluating the potential impact of relevant policies on online shopping demand.

preprint2011arXiv

Decentralized Online Learning Algorithms for Opportunistic Spectrum Access

The fundamental problem of multiple secondary users contending for opportunistic spectrum access over multiple channels in cognitive radio networks has been formulated recently as a decentralized multi-armed bandit (D-MAB) problem. In a D-MAB problem there are $M$ users and $N$ arms (channels) that each offer i.i.d. stochastic rewards with unknown means so long as they are accessed without collision. The goal is to design a decentralized online learning policy that incurs minimal regret, defined as the difference between the total expected rewards accumulated by a model-aware genie, and that obtained by all users applying the policy. We make two contributions in this paper. First, we consider the setting where the users have a prioritized ranking, such that it is desired for the $K$-th-ranked user to learn to access the arm offering the $K$-th highest mean reward. For this problem, we present the first distributed policy that yields regret that is uniformly logarithmic over time without requiring any prior assumption about the mean rewards. Second, we consider the case when a fair access policy is required, i.e., it is desired for all users to experience the same mean reward. For t

preprint2026arXiv

Spectral Graph Sparsification Preserves Representation Geometry in Graph Neural Networks

Spectral graph sparsification is a classical tool for reducing graph complexity while preserving Laplacian quadratic forms. In graph neural networks (GNNs), sparsification is often used to accelerate computation while maintaining predictive performance. In this work, we study a complementary representation-level question: does sparsification preserve the geometry of learned embeddings? For polynomial-filter GNNs, we prove that any $ε$-spectral sparsifier induces $O(ε)$ perturbations in polynomial graph filters, multilayer hidden representations, and their Gram matrices. These guarantees imply stability of squared pairwise distances, class means, and covariance structure in embedding space. We further establish finite-time training stability: under smoothness and boundedness assumptions, gradient descent on dense and sparsified graphs produces weight trajectories whose separation grows at most proportionally to the sparsification distortion. Empirically, effective-resistance sparsification validates the predicted perturbation chain on synthetic graphs and preserves hidden representation geometry on real datasets. In our experiments, the gram matrix and training dynamics show low divergence even under substantial sparsification, consistent with the predicted stability under spectral sparsification. Hidden Gram preservation strongly predicts neighborhood preservation and class-centroid stability across FashionMNIST, Cora, and Paul15. Together, these results show that spectral sparsification preserves not only graph operators, but also the representation geometry that supports downstream use of GNN embeddings for interpretability.

preprint2020arXiv

The Sample Complexity of Meta Sparse Regression

This paper addresses the meta-learning problem in sparse linear regression with infinite tasks. We assume that the learner can access several similar tasks. The goal of the learner is to transfer knowledge from the prior tasks to a similar but novel task. For p parameters, size of the support set k , and l samples per task, we show that T \in O (( k log(p) ) /l ) tasks are sufficient in order to recover the common support of all tasks. With the recovered support, we can greatly reduce the sample complexity for estimating the parameter of the novel task, i.e., l \in O (1) with respect to T and p . We also prove that our rates are minimax optimal. A key difference between meta-learning and the classical multi-task learning, is that meta-learning focuses only on the recovery of the parameters of the novel task, while multi-task learning estimates the parameter of all tasks, which requires l to grow with T . Instead, our efficient meta-learning estimator allows for l to be constant with respect to T (i.e., few-shot learning).

preprint2026arXiv

FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning

Federated Learning (FL) marks a transformative approach to distributed model training by combining locally optimized models from various clients into a unified global model. While FL preserves data privacy by eliminating centralized storage, it encounters significant challenges such as performance degradation, slower convergence, and reduced robustness of the global model due to the heterogeneity in client data distributions. Among the various forms of data heterogeneity, label skew emerges as a particularly formidable and prevalent issue, especially in domains such as image classification. To address these challenges, we begin with comprehensive experiments to pinpoint the underlying issues in the FL training process. Based on our findings, we then introduce an innovative dual-strategy approach designed to effectively resolve these issues. First, we introduce an adaptive loss function for client-side training, meticulously crafted to preserve previously acquired knowledge while maintaining an optimal equilibrium between local optimization and global model coherence. Secondly, we develop a dynamic aggregation strategy for aggregating client models at the server. This approach adapt

preprint2020arXiv

From deep to Shallow: Equivalent Forms of Deep Networks in Reproducing Kernel Krein Space and Indefinite Support Vector Machines

In this paper we explore a connection between deep networks and learning in reproducing kernel Krein space. Our approach is based on the concept of push-forward - that is, taking a fixed non-linear transform on a linear projection and converting it to a linear projection on the output of a fixed non-linear transform, pushing the weights forward through the non-linearity. Applying this repeatedly from the input to the output of a deep network, the weights can be progressively "pushed" to the output layer, resulting in a flat network that has the form of a fixed non-linear map (whose form is determined by the structure of the deep network) followed by a linear projection determined by the weight matrices - that is, we take a deep network and convert it to an equivalent (indefinite) kernel machine. We then investigate the implications of this transformation for capacity control and uniform convergence, and provide a Rademacher complexity bound on the deep network in terms of Rademacher complexity in reproducing kernel Krein space. Finally, we analyse the sparsity properties of the flat representation, showing that the flat weights are (effectively) Lp-"norm" regularise

preprint2020arXiv

Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems

Inferring behavior model of a running software system is quite useful for several automated software engineering tasks, such as program comprehension, anomaly detection, and testing. Most existing dynamic model inference techniques are white-box, i.e., they require source code to be instrumented to get run-time traces. However, in many systems, instrumenting the entire source code is not possible (e.g., when using black-box third-party libraries) or might be very costly. Unfortunately, most black-box techniques that detect states over time are either univariate, or make assumptions on the data distribution, or have limited power for learning over a long period of past behavior. To overcome the above issues, in this paper, we propose a hybrid deep neural network that accepts as input a set of time series, one per input/output signal of the system, and applies a set of convolutional and recurrent layers to learn the non-linear correlations between signals and the patterns, over time. We have applied our approach on a real UAV auto-pilot solution from our industry partner with half a million lines of C code. We ran 888 random recent system-level test cases and inferred states, over ti

preprint2020arXiv

Encoding-based Memory Modules for Recurrent Neural Networks

Learning to solve sequential tasks with recurrent models requires the ability to memorize long sequences and to extract task-relevant features from them. In this paper, we study the memorization subtask from the point of view of the design and training of recurrent neural networks. We propose a new model, the Linear Memory Network, which features an encoding-based memorization component built with a linear autoencoder for sequences. We extend the memorization component with a modular memory that encodes the hidden state sequence at different sampling frequencies. Additionally, we provide a specialized training algorithm that initializes the memory to efficiently encode the hidden activations of the network. The experimental results on synthetic and real-world datasets show that specializing the training algorithm to train the memorization component always improves the final performance whenever the memorization of long sequences is necessary to solve the problem.

preprint2024arXiv

A Theory of the Risk for Optimization with Relaxation and its Application to Support Vector Machines

In this paper we consider optimization with relaxation, an ample paradigm to make data-driven designs. This approach was previously considered by the same authors of this work in Garatti and Campi (2019), a study that revealed a deep-seated connection between two concepts: risk (probability of not satisfying a new, out-of-sample, constraint) and complexity (according to a definition introduced in paper Garatti and Campi (2019)). This connection was shown to have profound implications in applications because it implied that the risk can be estimated from the complexity, a quantity that can be measured from the data without any knowledge of the data-generation mechanism. In the present work we establish new results. First, we expand the scope of Garatti and Campi (2019) so as to embrace a more general setup that covers various algorithms in machine learning. Then, we study classical support vector methods - including SVM (Support Vector Machine), SVR (Support Vector Regression) and SVDD (Support Vector Data Description) - and derive new results for the ability of these methods to generalize. All results are valid for any finite size of the data set. When the sample size tends to infinity, we establish the unprecedented result that the risk approaches the ratio between the complexity and the cardinality of the data sample, regardless of the value of the complexity.

preprint2022arXiv

Multimodality in Meta-Learning: A Comprehensive Survey

Meta-learning has gained wide popularity as a training framework that is more data-efficient than traditional machine learning methods. However, its generalization ability in complex task distributions, such as multimodal tasks, has not been thoroughly studied. Recently, some studies on multimodality-based meta-learning have emerged. This survey provides a comprehensive overview of the multimodality-based meta-learning landscape in terms of the methodologies and applications. We first formalize the definition of meta-learning in multimodality, along with the research challenges in this growing field, such as how to enrich the input in few-shot learning (FSL) or zero-shot learning (ZSL) in multimodal scenarios and how to generalize the models to new tasks. We then propose a new taxonomy to discuss typical meta-learning algorithms in multimodal tasks systematically. We investigate the contributions of related papers and summarize them by our taxonomy. Finally, we propose potential research directions for this promising field.

preprint2026arXiv

Direction-Preserving Number Representations

Low-precision number formats are widely used in modern machine learning systems due to their efficiency. Accurate direction representation is key to the accuracy of vector operations. This work precisely explores the extent to which the direction of a vector can be represented by selecting its scalar elements from a common finite alphabet of a given size. This is standard practice in machine learning, where low-precision significands may be narrow-width floating-point or integer values. A geometric framework is introduced for analyzing the directional coverage of such product-structured codes. This work analytically quantifies the suboptimality gap between such product-structured codes and spherical codes for the vector as a whole, in both low and asymptotically high dimensions. Furthermore, within the product code class, it is proven that the standard formats of two's complement, fixed-point, and floating-point are suboptimal, again with quantified gap, pointing to the potential to develop new scalar number formats. Such scalar alphabets are numerically optimized across multiple block dimensions for directional coverage, including the dimension used in NVIDIA's NVFP4 format. Experimental results are presented comparing the performance of standard formats and the optimized alphabet. We find that for four bits, NVIDIA's choice of E2M1 closely approximates the optimized alphabet, providing a geometric explanation for its strong performance in low-precision machine learning workloads and an analytical understanding of the link between that superiority and block size. We provide open-source formal proofs in Lean for the theorems in this work, along with the experimental code and the optimized alphabets obtained.

preprint2020arXiv

ExTra: Transfer-guided Exploration

In this work we present a novel approach for transfer-guided exploration in reinforcement learning that is inspired by the human tendency to leverage experiences from similar encounters in the past while navigating a new task. Given an optimal policy in a related task-environment, we show that its bisimulation distance from the current task-environment gives a lower bound on the optimal advantage of state-action pairs in the current task-environment. Transfer-guided Exploration (ExTra) samples actions from a Softmax distribution over these lower bounds. In this way, actions with potentially higher optimum advantage are sampled more frequently. In our experiments on gridworld environments, we demonstrate that given access to an optimal policy in a related task-environment, ExTra can outperform popular domain-specific exploration strategies viz. epsilon greedy, Model-Based Interval Estimation - Exploration Bonus (MBIE-EB), Pursuit and Boltzmann in rate of convergence. We further show that ExTra is robust to choices of source task and shows a graceful degradation of performance as the dissimilarity of the source task increases. We also demonstrate that ExTra, when used alongside tradi

preprint2013arXiv

Multiple Kernel Learning in the Primal for Multi-modal Alzheimer's Disease Classification

To achieve effective and efficient detection of Alzheimer's disease (AD), many machine learning methods have been introduced into this realm. However, the general case of limited training samples, as well as different feature representations typically makes this problem challenging. In this work, we propose a novel multiple kernel learning framework to combine multi-modal features for AD classification, which is scalable and easy to implement. Contrary to the usual way of solving the problem in the dual space, we look at the optimization from a new perspective. By conducting Fourier transform on the Gaussian kernel, we explicitly compute the mapping function, which leads to a more straightforward solution of the problem in the primal space. Furthermore, we impose the mixed $L_{21}$ norm constraint on the kernel weights, known as the group lasso regularization, to enforce group sparsity among different feature modalities. This actually acts as a role of feature modality selection, while at the same time exploiting complementary information among different kernels. Therefore it is able to extract the most discriminative features for classification. Experiments on the ADNI data se

preprint2009arXiv

Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. "Structure" has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Beginning with the role of number theory in expressing data, we show how we can naturally proceed to hierarchical structures. We show how this both encapsulates traditional paradigms in data analysis, and also opens up new perspectives towards issues that are on the order of the day, including data mining of massive, high dimensional, heterogeneous data sets. Linkages with other fields are also discussed including computational logic and symbolic dynamics. The structures in data surveyed here are based on hierarchy, represented as p-adic numbers or an ultrametric topology.

preprint2015arXiv

Bayesian Masking: Sparse Bayesian Estimation with Weaker Shrinkage Bias

A common strategy for sparse linear regression is to introduce regularization, which eliminates irrelevant features by letting the corresponding weights be zeros. However, regularization often shrinks the estimator for relevant features, which leads to incorrect feature selection. Motivated by the above-mentioned issue, we propose Bayesian masking (BM), a sparse estimation method which imposes no regularization on the weights. The key concept of BM is to introduce binary latent variables that randomly mask features. Estimating the masking rates determines the relevance of the features automatically. We derive a variational Bayesian inference algorithm that maximizes the lower bound of the factorized information criterion (FIC), which is a recently developed asymptotic criterion for evaluating the marginal log-likelihood. In addition, we propose reparametrization to accelerate the convergence of the derived algorithm. Finally, we show that BM outperforms Lasso and automatic relevance determination (ARD) in terms of the sparsity-shrinkage trade-off.

preprint2026arXiv

Same Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability

Electroencephalography (EEG) is a cornerstone of brain-computer interfaces and clinical neuroscience, yet deep learning models are typically trained and evaluated under a single, unreported preprocessing pipeline. We formalize preprocessing choices as a counterfactual intervention space and show that EEG predictions are surprisingly unstable under this space: across six datasets spanning four paradigms, up to 42% of trial-level predictions flip when only the preprocessing changes, a variability that standard uncertainty methods do not explicitly quantify because they condition on a fixed preprocessing pipeline. We provide three tools to make this instability measurable, decomposable, and reducible. First, a Walsh-Hadamard decomposition of the 2^7 pipeline space reveals that sensitivity is near-additive in practice under the binary intervention design, enabling efficient step-by-step optimization. Second, we introduce Preprocessing Uncertainty (PU), a per-trial diagnostic that captures a dimension of instability complementary to model-based confidence. Third, we study Normalized Adaptive PGI (NA-PGI), a graph-structured regularizer that exploits the compositional structure of preprocessing interventions as one mitigation strategy with clear scope conditions.

preprint2022arXiv

Exploring the stimulative effect on following drivers in a consecutive lane-change using microscopic vehicle trajectory data

Improper lane-changing behaviors may result in breakdown of traffic flow and the occurrence of various types of collisions. This study investigates lane-changing behaviors of multiple vehicles and the stimulative effect on following drivers in a consecutive lane-changing scenario. The microscopic trajectory data from the dataset are used for driving behavior analysis.Two discretionary lane-changing vehicle groups constitute a consecutive lane-changing scenario, and not only distance- and speed-related factors but also driving behaviors are taken into account to examine the impacts on the utility of following lane-changing vehicles.A random parameters logit model is developed to capture the driver psychological heterogeneity in the consecutive lane-changing situation.Furthermore, a lane-changing utility prediction model is established based on three supervised learning algorithms to detect the improper lane-changing decision. Results indicate that (1) the consecutive lane-changing behaviors have a significant negative effect on the following lane-changing vehicles after lane-change; (2) the stimulative effect exists in a consecutive lane-change situation and its influence is heterogeneous due to different psychological activities of drivers; and (3) the utility prediction model can be used to detect an improper lane-changing decision.

preprint2020arXiv

Learning towards Minimum Hyperspherical Energy

Neural networks are a powerful class of nonlinear functions that can be trained end-to-end on various applications. While the over-parametrization nature in many neural networks renders the ability to fit complex functions and the strong representation power to handle challenging tasks, it also leads to highly correlated neurons that can hurt the generalization ability and incur unnecessary computation cost. As a result, how to regularize the network to avoid undesired representation redundancy becomes an important issue. To this end, we draw inspiration from a well-known problem in physics -- Thomson problem, where one seeks to find a state that distributes N electrons on a unit sphere as evenly as possible with minimum potential energy. In light of this intuition, we reduce the redundancy regularization problem to generic energy minimization, and propose a minimum hyperspherical energy (MHE) objective as generic regularization for neural networks. We also propose a few novel variants of MHE, and provide some insights from a theoretical point of view. Finally, we apply neural networks with MHE regularization to several challenging tasks. Extensive experiments demonstrate the effec

preprint2021arXiv

Free-rider Attacks on Model Aggregation in Federated Learning

Free-rider attacks against federated learning consist in dissimulating participation to the federated learning process with the goal of obtaining the final aggregated model without actually contributing with any data. This kind of attacks is critical in sensitive applications of federated learning, where data is scarce and the model has high commercial value. We introduce here the first theoretical and experimental analysis of free-rider attacks on federated learning schemes based on iterative parameters aggregation, such as FedAvg or FedProx, and provide formal guarantees for these attacks to converge to the aggregated models of the fair participants. We first show that a straightforward implementation of this attack can be simply achieved by not updating the local parameters during the iterative federated optimization. As this attack can be detected by adopting simple countermeasures at the server level, we subsequently study more complex disguising schemes based on stochastic updates of the free-rider parameters. We demonstrate the proposed strategies on a number of experimental scenarios, in both iid and non-iid settings. We conclude by providing recommendations to avoid free-rider attacks in real world applications of federated learning, especially in sensitive domains where security of data and models is critical.

preprint2026arXiv

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.