Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
37works
0followers
23topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

37 published item(s)

preprint2026arXiv

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.

preprint2026arXiv

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.

preprint2025arXiv

Universal Battery Degradation Forecasting Driven by Foundation Model Across Diverse Chemistries and Conditions

Accurate forecasting of battery capacity fade is essential for the safety, reliability, and long-term efficiency of energy storage systems. However, the strong heterogeneity across cell chemistries, form factors, and operating conditions makes it difficult to build a single model that generalizes beyond its training domain. This work proposes a unified capacity forecasting framework that maintains robust performance across diverse chemistries and usage scenarios. We curate 20 public aging datasets into a large-scale corpus covering 1,704 cells and 3,961,195 charge-discharge cycle segments, spanning temperatures from $-5\,^{\circ}\mathrm{C}$ to $45\,^{\circ}\mathrm{C}$, multiple C-rates, and application-oriented profiles such as fast charging and partial cycling. On this corpus, we adopt a Time-Series Foundation Model (TSFM) backbone and apply parameter-efficient Low-Rank Adaptation (LoRA) together with physics-guided contrastive representation learning to capture shared degradation patterns. Experiments on both seen and deliberately held-out unseen datasets show that a single unified model achieves competitive or superior accuracy compared with strong per-dataset baselines, while retaining stable performance on chemistries, capacity scales, and operating conditions excluded from training. These results demonstrate the potential of TSFM-based architectures as a scalable and transferable solution for capacity degradation forecasting in real battery management systems.

preprint2022arXiv

Cohomology algebras of a family of cochain DG skew polynomial algebras

Let $\mathcal{A}$ be a connected cochain DG algebra such that its underlying graded algebra $\mathcal{A}^{\#}$ is the graded skew polynomial algebra $$k\langle x_1,x_2, x_3\rangle/\left(\begin{array}{ccc} x_1x_2+x_2x_1\\ x_2x_3+x_3x_2\\ x_3x_1+x_1x_3 \end {array}\right), |x_1|=|x_2|=|x_3|=1.$$ From \cite{MWZ} or \cite{MWYZ}, one sees that the differential $\partial_{\mathcal{A}}$ is determined by \begin{align*} \left( \begin{array}{c} \partial_{\mathcal{A}}(x_1) \partial_{\mathcal{A}}(x_2) \partial_{\mathcal{A}}(x_3) \end{array} \right)=M\left( \begin{array}{c} x_1^2 x_2^2 x_3^2 \end{array} \right), \end{align*} for some $M\in M_3(k)$. For the case $1\le r(M)\le 3$, we compute $H(\mathcal{A})$ case by case. The computational results in this paper give substantial support for \cite{MWZ}, where the various homological properties of such DG algebras are systematically studied. We find some examples, which indicate that the cohomology graded algebra of a Koszul Calabi-Yau DG algebra may be not left (right) Gorenstein.

preprint2022arXiv

Converse: A Tree-Based Modular Task-Oriented Dialogue System

Creating a system that can have meaningful conversations with humans to help accomplish tasks is one of the ultimate goals of Artificial Intelligence (AI). It has defined the meaning of AI since the beginning. A lot has been accomplished in this area recently, with voice assistant products entering our daily lives and chat bot systems becoming commonplace in customer service. At first glance there seems to be no shortage of options for dialogue systems. However, the frequently deployed dialogue systems today seem to all struggle with a critical weakness - they are hard to build and harder to maintain. At the core of the struggle is the need to script every single turn of interactions between the bot and the human user. This makes the dialogue systems more difficult to maintain as the tasks become more complex and more tasks are added to the system. In this paper, we propose Converse, a flexible tree-based modular task-oriented dialogue system. Converse uses an and-or tree structure to represent tasks and offers powerful multi-task dialogue management. Converse supports task dependency and task switching, which are unique features compared to other open-source dialogue frameworks. At the same time, Converse aims to make the bot building process easy and simple, for both professional and non-professional software developers. The code is available at https://github.com/salesforce/Converse.

preprint2022arXiv

Critical behavior in the Mn$_{5}$Ge$_{3}$ ferromagnet

High-Curie-temperature ferromagnets are promising candidates for designing new spintronic devices. Here we have successfully synthesized a single-crystal sample of the itinerant ferromagnet Mn$ _{5}$Ge$_{3}$ used flux method and its critical properties were investigated by means of bulk dc-magnetization at the boundary between the ferromagnetic (FM) and paramagnetic (PM) phase. Critical exponents $ β=0.336 \pm 0.001 $ with a critical temperature $ T_{c}=300.29 \pm 0.01 $ K and $ γ=1.193 \pm 0.003 $ with $ T_{c} = 300.15 \pm 0.05 $ K are obtained by the modified Arrott plot, whereas $ δ= 4.61 \pm 0.03 $ is deduced by a critical isotherm analysis at $ T_{c} = 300 $ K. The self-consistency and reliability of these critical exponents are verified by the Widom scaling law and the scaling equations. Further analysis reveals that the spin coupling in Mn$ _{5}$Ge$_{3}$ exhibits three-dimensional Ising-like behavior. The magnetic exchange is found to decay as $ J(r)\approx r^{-4.855} $ and the spin interactions are extended beyond the nearest neighbors, which may be related to different set of Mn--Mn interactions with unequal magnitude of exchange strengths. Additionally, the existence of noncollinear spin configurations in Mn$ _{5} $Ge$ _{3} $ results in a small deviation of obtained critical exponents from those for standard 3D-Ising model.

preprint2022arXiv

Design of Core-Shell Structured Magnetic Microwires with Desirable Properties for Multifunctional Applications

Amorphous Co-rich microwires with excellent soft magnetic and mechanical properties produced by melt-extraction technique are emerging as a multifunctional material for a variety of applications ranging from ultrasensitive magnetic field sensors to structural health self-monitoring composites. There is a pressing need for enhancing these properties to make the microwires practical for integration into new technologies. Conventional heat treatments at temperature below crystallization may improve the magnetic softness of an as-quenched amorphous wire, but usually deteriorate the good mechanical characteristic of the wire due to crystallization. To overcome this, we propose a new approach that utilizes the advantages of a multi-step Joule current annealing method to design novel (nanocrystal, amorphous)/amorphous core/shell structures directly from as-quenched amorphous microwires. These results show that the density and size of nanocrystals in the core can be optimized by controlling the Joule current intensity, resulting in the large enhancement of soft magnetic and giant magneto-impedance properties, while the amorphous shell preserves the excellent mechanical strength of the microwire. This study also provides a new pathway for the design of novel core/shell structures directly from rapidly quenched amorphous magnetic materials that are currently exploited in high frequency transformers, sensing and cooling devices.

preprint2022arXiv

Dual Lottery Ticket Hypothesis

Fully exploiting the learning capacity of neural networks requires overparameterized dense networks. On the other side, directly training sparse neural networks typically results in unsatisfactory performance. Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity. Concretely, it claims there exist winning tickets from a randomly initialized network found by iterative magnitude pruning and preserving promising trainability (or we say being in trainable condition). In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark, then go from a complementary direction to articulate the Dual Lottery Ticket Hypothesis (DLTH): Randomly selected subnetworks from a randomly initialized dense network can be transformed into a trainable condition and achieve admirable performance compared with LTH -- random tickets in a given lottery pool can be transformed into winning tickets. Specifically, by using uniform-randomly selected subnetworks to represent the general cases, we propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH. Concretely, we introduce a regularization term to borrow learning capacity and realize information extrusion from the weights which will be masked. After finishing the transformation for the randomly selected subnetworks, we conduct the regular finetuning to evaluate the model using fair comparisons with LTH and other strong baselines. Extensive experiments on several public datasets and comparisons with competitive approaches validate our DLTH as well as the effectiveness of the proposed model RST. Our work is expected to pave a way for inspiring new research directions of sparse network training in the future. Our code is available at https://github.com/yueb17/DLTH.

preprint2022arXiv

Efficient and Differentiable Conformal Prediction with General Function Classes

Quantifying the data uncertainty in learning tasks is often done by learning a prediction interval or prediction set of the label given the input. Two commonly desired properties for learned prediction sets are \emph{valid coverage} and \emph{good efficiency} (such as low length or low cardinality). Conformal prediction is a powerful technique for learning prediction sets with valid coverage, yet by default its conformalization step only learns a single parameter, and does not optimize the efficiency over more expressive function classes. In this paper, we propose a generalization of conformal prediction to multiple learnable parameters, by considering the constrained empirical risk minimization (ERM) problem of finding the most efficient prediction set subject to valid empirical coverage. This meta-algorithm generalizes existing conformal prediction algorithms, and we show that it achieves approximate valid population coverage and near-optimal efficiency within class, whenever the function class in the conformalization step is low-capacity in a certain sense. Next, this ERM problem is challenging to optimize as it involves a non-differentiable coverage constraint. We develop a gradient-based algorithm for it by approximating the original constrained ERM using differentiable surrogate losses and Lagrangians. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly over existing approaches in several applications such as prediction intervals with improved length, minimum-volume prediction sets for multi-output regression, and label prediction sets for image classification.

preprint2022arXiv

Generating Negative Samples for Sequential Recommendation

To make Sequential Recommendation (SR) successful, recent works focus on designing effective sequential encoders, fusing side information, and mining extra positive self-supervision signals. The strategy of sampling negative items at each time step is less explored. Due to the dynamics of users' interests and model updates during training, considering randomly sampled items from a user's non-interacted item set as negatives can be uninformative. As a result, the model will inaccurately learn user preferences toward items. Identifying informative negatives is challenging because informative negative items are tied with both dynamically changed interests and model parameters (and sampling process should also be efficient). To this end, we propose to Generate Negative Samples (items) for SR (GenNi). A negative item is sampled at each time step based on the current SR model's learned user preferences toward items. An efficient implementation is proposed to further accelerate the generation process, making it scalable to large-scale recommendation tasks. Extensive experiments on four public datasets verify the importance of providing high-quality negative samples for SR and demonstrate the effectiveness and efficiency of GenNi.

preprint2022arXiv

Large anomalous Hall effect in layered antiferromagnet Co$_{0.29}$TaS$_2$

We present a study on the magnetization, anomalous Hall effect (AHE) and novel longitudinal resistivity in layered antiferromagnet Co$_{0.29}$TaS$_{2}$. Of particular interests in Co$_{0.29}$TaS$_{2}$ are abundant magnetic transitions, which show that the magnetic structures are tuned by temperature or magnetic field. With decreasing temperature, Co$_{0.29}$TaS$_{2}$ undergoes two transitions at T$_{t1}\sim$ 38.3 K and T$_{t2}\sim$ 24.3 K. Once the magnetic field is applied, another transition T$_{t3}\sim$ 34.3 K appears between 0.3 T and 5 T. At 2 K, an obvious ferromagnetic hysteresis loop within H$_{t1}\sim\pm$ 6.9 T is observed, which decreases with increasing temperature and eventually disappears at T$_{t2}$. Besides, Co$_{0.29}$TaS$_{2}$ displays step-like behavior as another magnetic transition around H$_{t2}\sim\pm$ 4 T, which exists until $\sim$ T$_{t1}$. These characteristic temperatures and magnetic fields mark complex magnetic phase transitions in Co$_{0.29}$TaS$_{2}$, which are also evidenced in transport results. Large AHE dominates in the Hall resistivity with the conspicuous value of R$_{s}$/R$_{0}\sim 10^{5}$, considering that the tiny net magnetization (0.0094$μ_{B}$/Co) alone would not lead to this value, thus the contribution of Berry curvature is necessary. The longitudinal resistivity illustrates a prominent irreversible behavior within H$_{t1}$. The abrupt change at H$_{t2}$ below T$_{t1}$, corresponding to the step-like magnetic transitions, is also observed. Synergy between the magnetism and topological properties, both playing a crucial role, may be the key factor of large AHE in antiferromagnet, which also offers a new perspective in magnetic topological materials with the platform of Co$_{0.29}$TaS$_{2}$.

preprint2022arXiv

Local Calibration: Metrics and Recalibration

Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability on average across the entire population. However, it is in general impossible to measure the reliability of an individual prediction. In this work, we propose the local calibration error (LCE) to span the gap between average and individual reliability. For each individual prediction, the LCE measures the average reliability of a set of similar predictions, where similarity is quantified by a kernel function on a pretrained feature space and by a binning scheme over predicted model confidences. We show theoretically that the LCE can be estimated sample-efficiently from data, and empirically find that it reveals miscalibration modes that are more fine-grained than the ECE can detect. Our key result is a novel local recalibration method LoRe, to improve confidence scores for individual predictions and decrease the LCE. Experimentally, we show that our recalibration method produces more accurate confidence scores, which improves downstream fairness and decision making on classification tasks with both image and tabular data.

preprint2022arXiv

Local cohomology for Gorenstein homologically smooth DG algebras

In this paper, we introduce the theory of local cohomology and local duality to Notherian connected cochain DG algebras. We show that the notion of local cohomology functor can be used to detect the Gorensteinness of a homologically smooth DG algebra. For any Gorenstein homologically smooth locally finite DG algebra $\mathcal{A}$, we define a group homomorphism $\mathrm{Hdet}: \mathrm{Aut}_{dg}(\mathcal{A})\to k^{\times},$ called the homological determinant. As applications, we present a sufficient condition for the invariant DG subalgebra $\mathcal{A}^G$ to be Gorensten, where $\mathcal{A}$ is a homologically smooth DG algebra such that $H(\mathcal{A})$ is a Noetherian AS-Gorenstein graded algebra and $G$ is a finite subgroup of $\mathrm{Aut}_{dg}(\mathcal{A})$. Especially, we can apply this result to DG down-up algebras and non-trivial DG free algebras generated in two degree-one elements.

preprint2022arXiv

Moiré Engineering and Topological Flat Bands in Twisted Orbital-Active Bilayers

Topological flat bands at the Fermi level offer a promising platform to study a variety of intriguing correlated phase of matter. Here we present band engineering in the twisted orbital-active bilayers with spin-orbit coupling. The symmetry constraints on the interlayer coupling that determines the effective potential for low-energy physics of moiré electrons are exhaustively derived for two-dimensional point groups. We find the line graph or biparticle sublattice of moiré pattern emerge with a minimal $C_3$ symmetry, which exhibit isolated electronic flat bands with nontrivial topology. The band flatness is insensitive to the twist angle since they come from the interference effect. Armed with this guiding principle, we predict that twisted bilayers of 2H-PbS$_2$ and CdS realize the salient physics to engineer two-dimensional topological quantum phases. At small twist angles, PbS$_2$ heterostructures give rise to an emergent moiré Kagomé lattice, while CdS heterostructures lead to an emergent moiré honeycomb lattice, and both of them host moiré quantum spin Hall insulators with almost flat topological bands. We further study superconductivity of these two systems with local attractive interactions. The superfluid weight and Berezinskii-Kosterlitz-Thouless temperature are determined by multiband processes and quantum geometry of the band in the flat-band limit when the pairing potential exceeds the band width. Our results demonstrate twisted bilayers with multi-orbitals as a promising tunable platform to realize correlated topological phases.

preprint2022arXiv

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" $μ$ close to the optimal policy $π_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp offline reduction algorithm -- which simply executes $μ$ and runs offline policy optimization on the collected dataset -- that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $μ$ and $π_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $Ω(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that -- perhaps surprisingly -- the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $μ$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $μ$ only satisfies concentrability partially up to a certain time step.

preprint2022arXiv

Privacy-Preserving Face Recognition with Learnable Privacy Budgets in Frequency Domain

Face recognition technology has been used in many fields due to its high recognition accuracy, including the face unlocking of mobile devices, community access control systems, and city surveillance. As the current high accuracy is guaranteed by very deep network structures, facial images often need to be transmitted to third-party servers with high computational power for inference. However, facial images visually reveal the user's identity information. In this process, both untrusted service providers and malicious users can significantly increase the risk of a personal privacy breach. Current privacy-preserving approaches to face recognition are often accompanied by many side effects, such as a significant increase in inference time or a noticeable decrease in recognition accuracy. This paper proposes a privacy-preserving face recognition method using differential privacy in the frequency domain. Due to the utilization of differential privacy, it offers a guarantee of privacy in theory. Meanwhile, the loss of accuracy is very slight. This method first converts the original image to the frequency domain and removes the direct component termed DC. Then a privacy budget allocation method can be learned based on the loss of the back-end face recognition network within the differential privacy framework. Finally, it adds the corresponding noise to the frequency domain features. Our method performs very well with several classical face recognition test sets according to the extensive experiments.

preprint2022arXiv

Recent Advances on Neural Network Pruning at Initialization

Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to PaT and discuss the open problems. Apart from the dedicated literature review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.

preprint2022arXiv

Rethinking Adam: A Twofold Exponential Moving Average Approach

Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization ability compared with stochastic gradient descent (\textsc{SGD}) and tend to be trapped in local minima at an early stage during training. Intriguingly, we discover that substituting the gradient in the second raw moment estimate term with its momentumized version in \textsc{Adam} can resolve the issue. The intuition is that gradient with momentum contains more accurate directional information and therefore its second moment estimation is a more favorable option for learning rate scaling than that of the raw gradient. Thereby we propose \textsc{AdaMomentum} as a new optimizer reaching the goal of training fast while generalizing much better. We further develop a theory to back up the improvement in generalization and provide convergence guarantees under both convex and nonconvex settings. Extensive experiments on a wide range of tasks and models demonstrate that \textsc{AdaMomentum} exhibits state-of-the-art performance and superior training stability consistently.

preprint2022arXiv

STN: Scalable Tensorizing Networks via Structure-Aware Training and Adaptive Compression

Deep neural networks (DNNs) have delivered a remarkable performance in many tasks of computer vision. However, over-parameterized representations of popular architectures dramatically increase their computational complexity and storage costs, and hinder their availability in edge devices with constrained resources. Regardless of many tensor decomposition (TD) methods that have been well-studied for compressing DNNs to learn compact representations, they suffer from non-negligible performance degradation in practice. In this paper, we propose Scalable Tensorizing Networks (STN), which dynamically and adaptively adjust the model size and decomposition structure without retraining. First, we account for compression during training by adding a low-rank regularizer to guarantee networks' desired low-rank characteristics in full tensor format. Then, considering network layers exhibit various low-rank structures, STN is obtained by a data-driven adaptive TD approach, for which the topological structure of decomposition per layer is learned from the pre-trained model, and the ranks are selected appropriately under specified storage constraints. As a result, STN is compatible with arbitrary network architectures and achieves higher compression performance and flexibility over other tensorizing versions. Comprehensive experiments on several popular architectures and benchmarks substantiate the superiority of our model towards improving parameter efficiency.

preprint2021arXiv

Automatic Segmentation of Organs-at-Risk from Head-and-Neck CT using Separable Convolutional Neural Network with Hard-Region-Weighted Loss

Nasopharyngeal Carcinoma (NPC) is a leading form of Head-and-Neck (HAN) cancer in the Arctic, China, Southeast Asia, and the Middle East/North Africa. Accurate segmentation of Organs-at-Risk (OAR) from Computed Tomography (CT) images with uncertainty information is critical for effective planning of radiation therapy for NPC treatment. Despite the stateof-the-art performance achieved by Convolutional Neural Networks (CNNs) for automatic segmentation of OARs, existing methods do not provide uncertainty estimation of the segmentation results for treatment planning, and their accuracy is still limited by several factors, including the low contrast of soft tissues in CT, highly imbalanced sizes of OARs and large inter-slice spacing. To address these problems, we propose a novel framework for accurate OAR segmentation with reliable uncertainty estimation. First, we propose a Segmental Linear Function (SLF) to transform the intensity of CT images to make multiple organs more distinguishable than existing methods based on a simple window width/level that often gives a better visibility of one organ while hiding the others. Second, to deal with the large inter-slice spacing, we introduce a novel 2.5D network (named as 3D-SepNet) specially designed for dealing with clinic HAN CT scans with anisotropic spacing. Thirdly, existing hardness-aware loss function often deal with class-level hardness, but our proposed attention to hard voxels (ATH) uses a voxel-level hardness strategy, which is more suitable to dealing with some hard regions despite that its corresponding class may be easy. Our code is now available at https://github.com/HiLab-git/SepNet.

preprint2021arXiv

Contradictory Structure Learning for Semi-supervised Domain Adaptation

Current adversarial adaptation methods attempt to align the cross-domain features, whereas two challenges remain unsolved: 1) the conditional distribution mismatch and 2) the bias of the decision boundary towards the source domain. To solve these challenges, we propose a novel framework for semi-supervised domain adaptation by unifying the learning of opposite structures (UODA). UODA consists of a generator and two classifiers (i.e., the source-scattering classifier and the target-clustering classifier), which are trained for contradictory purposes. The target-clustering classifier attempts to cluster the target features to improve intra-class density and enlarge inter-class divergence. Meanwhile, the source-scattering classifier is designed to scatter the source features to enhance the decision boundary's smoothness. Through the alternation of source-feature expansion and target-feature clustering procedures, the target features are well-enclosed within the dilated boundary of the corresponding source features. This strategy can make the cross-domain features to be precisely aligned against the source bias simultaneously. Moreover, to overcome the model collapse through training, we progressively update the measurement of feature's distance and their representation via an adversarial training paradigm. Extensive experiments on the benchmarks of DomainNet and Office-home datasets demonstrate the superiority of our approach over the state-of-the-art methods.

preprint2021arXiv

How Important is the Train-Validation Split in Meta-Learning?

Meta-learning aims to perform fast adaptation on a new task through learning a "prior" from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct \emph{train-train method}, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful in the linear centroid meta-learning problem. In the agnostic case, we show that the expected loss of the train-val method is minimized at the optimal prior for meta testing, and this is not the case for the train-train method in general without structural assumptions on the data. In contrast, in the realizable case where the data are generated from linear models, we show that both the train-val and train-train losses are minimized at the optimal prior in expectation. Further, perhaps surprisingly, our main result shows that the train-train method achieves a \emph{strictly better} excess loss in this realizable case, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that sample splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.

preprint2021arXiv

Magnetic moiré surface states and flat chern band in topological insulators

We theoretically study the effect of magnetic moiré superlattice on the topological surface states by introducing a continuum model of Dirac electrons with a single Dirac cone moving in the time-reversal symmetry breaking periodic pontential. The Zeeman-type moiré potentials generically gap out the moiré surface Dirac cones and give rise to isolated flat Chern minibands with Chern number $\pm1$. This result provides a promising platform for realizing the time-reversal breaking correlated topological phases. In a $C_6$ periodic potential, when the scalar $U_0$ and Zeeman $Δ_1$ moiré potential strengths are equal to each other, we find that energetically the first three bands of $Γ$-valley moiré surface electrons are non-degenerate and realize i) an $s$-orbital model on a honeycomb lattice, ii) a degenerate $p_x,p_y$-orbitals model on a honeycomb lattice, and iii) a hybridized $sd^2$-orbital model on a kagome lattice, where moiré surface Dirac cones in these bands emerge. When $U_0\neqΔ_1$, the difference between the two moiré potential serves as an effective spin-orbit coupling and opens a topological gap in the emergent moiré surface Dirac cones.

preprint2021arXiv

Manipulating Goldstone modes via the superradiant light in a bosonic lattice gas inside a cavity

We study the low-energy excitations of a bosonic lattice gas with cavity-mediated interactions. By performing two successive Hubbard-Stratonovich transformations, we derive an effective field theory to study the strongly-coupling regime. Taking into account the quantum fluctuation, we report the unusual effect of the superradiant cavity light induced density imbalance, which has been shown to have a negligible effect on the single particle excitation in the previous studies. Instead, we show that such negligible fluctuation of density imbalance dramatically changes the behavior of the low-energy excitation and results in a free switching between two types of Goldstone modes in its single particle excitation, i.e., type I and type II with odd and even power energy-momentum dispersion, respectively. Our proposal would open a new horizon for manipulating Goldstone modes from bridging the cavity light and strongly interacting quantum matters.

preprint2021arXiv

Towards Understanding Hierarchical Learning: Benefits of Neural Representations

Deep neural networks can empirically perform efficient hierarchical learning, in which the layers learn useful representations of the data. However, how they make use of the intermediate representations are not explained by recent theories that relate them to "shallow learners" such as kernels. In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks and can be advantageous over raw inputs. We consider a fixed, randomly initialized neural network as a representation function fed into another trainable network. When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural representation requires only $\tilde{O}(d^{\lceil p/2 \rceil})$ samples, while the best-known sample complexity upper bound for the raw input is $\tilde{O}(d^{p-1})$. We contrast our result with a lower bound showing that neural representations do not improve over the raw input (in the infinite width limit), when the trainable network is instead a neural tangent kernel. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.

preprint2021arXiv

Use of NMR to Test Molecular Mobility during Chemical Reaction

We evaluate critically the use of pulsed gradient spin-echo nuclear magnetic resonance (PGSE NMR) to measure molecular mobility during chemical reactions. With raw NMR spectra available in a public depository, we confirm boosted mobility during the click chemical reaction (Science 2020, 369, 537) regardless of the order of magnetic field gradient (linearly-increasing, linearly-decreasing, random sequence). We also confirm boosted mobility for the Diels-Alder chemical reaction. The conceptual advantage of the former chemical system is that constant reaction rate implies constant catalyst concentration, whereas that of the latter is the absence of a paramagnetic catalyst, precluding paramagnetism as objection to the measurements. Data and discussion in this paper show the reliability of experiments when one avoids convection, allows decay of nuclear spin magnetization between successive pulses and recovery of its intensity between gradients, and satisfies quasi-steady state during the time window to acquire each datum. Especially important is to make comparisons on the time scale of actual chemical reaction kinetics. We discuss possible sources of mistaken conclusions that are desirable to avoid.

preprint2020arXiv

Collaborative Distillation for Ultra-Resolution Universal Style Transfer

Universal style transfer methods typically leverage rich representations from deep Convolutional Neural Network (CNN) models (e.g., VGG-19) pre-trained on large collections of images. Despite the effectiveness, its application is heavily constrained by the large model size to handle ultra-resolution images given limited memory. In this work, we present a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models. Moreover, to overcome the feature size mismatch when applying collaborative distillation, a linear embedding loss is introduced to drive the student network to learn a linear embedding of the teacher's features. Extensive experiments show the effectiveness of our method when applied to different universal style transfer approaches (WCT and AdaIN), even if the model size is reduced by 15.5 times. Especially, on WCT with the compressed models, we achieve ultra-resolution (over 40 megapixels) universal style transfer on a 12GB GPU for the first time. Further experiments on optimization-based stylization scheme show the generality of our algorithm on different stylization paradigms. Our code and trained models are available at https://github.com/mingsun-tse/collaborative-distillation.

preprint2020arXiv

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Pretrained language models have led to significant performance gains in many NLP tasks. However, the intensive computing resources to train such models remain an issue. Knowledge distillation alleviates this problem by learning a light-weight student model. So far the distillation approaches are all task-specific. In this paper, we explore knowledge distillation under the multi-task learning setting. The student is jointly distilled across different tasks. It acquires more general representation capacity through multi-tasking distillation and can be further fine-tuned to improve the model in the target domain. Unlike other BERT distillation methods which specifically designed for Transformer-based architectures, we provide a general learning framework. Our approach is model agnostic and can be easily applied on different future teacher model architectures. We evaluate our approach on a Transformer-based and LSTM based student model. Compared to a strong, similarly LSTM-based approach, we achieve better quality under the same computational constraints. Compared to the present state of the art, we reach comparable results with much faster inference speed.

preprint2020arXiv

MNN: A Universal and Efficient Inference Engine

Deploying deep learning models on mobile devices draws more and more attention recently. However, designing an efficient inference engine on devices is under the great challenges of model compatibility, device diversity, and resource limitation. To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. In this paper, the contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2)deliveringthorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight. Extensive benchmark experiments demonstrate that MNN performs favorably against other popular lightweight deep learning frameworks. MNN is available to public at: https://github.com/alibaba/MNN.

preprint2020arXiv

Neural Bayes: A Generic Parameterization Method for Unsupervised Representation Learning

We introduce a parameterization method called Neural Bayes which allows computing statistical quantities that are in general difficult to compute and opens avenues for formulating new objectives for unsupervised representation learning. Specifically, given an observed random variable $\mathbf{x}$ and a latent discrete variable $z$, we can express $p(\mathbf{x}|z)$, $p(z|\mathbf{x})$ and $p(z)$ in closed form in terms of a sufficiently expressive function (Eg. neural network) using our parameterization without restricting the class of these distributions. To demonstrate its usefulness, we develop two independent use cases for this parameterization: 1. Mutual Information Maximization (MIM): MIM has become a popular means for self-supervised representation learning. Neural Bayes allows us to compute mutual information between observed random variables $\mathbf{x}$ and latent discrete random variables $z$ in closed form. We use this for learning image representations and show its usefulness on downstream classification tasks. 2. Disjoint Manifold Labeling: Neural Bayes allows us to formulate an objective which can optimally label samples from disjoint manifolds present in the support of a continuous distribution. This can be seen as a specific form of clustering where each disjoint manifold in the support is a separate cluster. We design clustering tasks that obey this formulation and empirically show that the model optimally labels the disjoint manifolds. Our code is available at \url{https://github.com/salesforce/NeuralBayes}

preprint2020arXiv

PLVER: Joint Stable Allocation and Content Replication for Edge-assisted Live Video Delivery

The live streaming services have gained extreme popularity in recent years. Due to the spiky traffic patterns of live videos, utilizing the distributed edge servers to improve viewers' quality of experience (QoE) has become a common practice nowadays. Nevertheless, current client-driven content caching mechanism does not support caching beforehand from the cloud to the edge, resulting in considerable cache missing in live video delivery. State-of-the-art research generally sacrifices the liveness of delivered videos in order to deal with the above problem. In this paper, by jointly considering the features of live videos and edge servers, we propose PLVER, a proactive live video push scheme to resolve the cache miss problem in live video delivery. Specifically, PLVER first conducts a one-tomultiple stable allocation between edge clusters and user groups, to balance the load of live traffic over the edge servers. Then it adopts proactive video replication algorithms to speed up the video replication among the edge servers. We conduct extensive trace-driven evaluations, covering 0.3 million Twitch viewers and more than 300 Twitch channels. The results demonstrate that with PLVER, edge servers can carry 28% and 82% more traffic than the auction-based replication method and the caching on requested time method, respectively.

preprint2020arXiv

Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width

We propose \emph{Taylorized training} as an initiative towards better understanding neural network training at finite width. Taylorized training involves training the $k$-th order Taylor expansion of the neural network at initialization, and is a principled extension of linearized training---a recently proposed theory for understanding the success of deep learning. We experiment with Taylorized training on modern neural network architectures, and show that Taylorized training (1) agrees with full neural network training increasingly better as we increase $k$, and (2) can significantly close the performance gap between linearized and full training. Compared with linearized training, higher-order training works in more realistic settings such as standard parameterization and large (initial) learning rate. We complement our experiments with theoretical results showing that the approximation error of $k$-th order Taylorized models decay exponentially over $k$ in wide neural networks.

preprint2019arXiv

Anharmonicity Induced Supersolidity In Spin-Orbit Coupled Bose-Einstein Condensates

Supersolid, a fascinating quantum state of matter, features novel phenomena such as the non-classical rotational inertia and transport anomalies. It is a long standing issue of the coexistence of superfluidity and broken translational symmetry in condensed matter physics. By recent experimental advances to create tunable synthetic spin-orbit coupling in ultracold gases, such highly controllable atomic systems would provide new possibilities to access supersolidity with no counterpart in solids. Here we report that the combination of anharmonicity of trapping potential and spin-orbit coupling will provide a new paradigm to achieve supersolids. By means of imaginary time evolution of the Gross-Pitaevskii equation, we demonstrate that a supersolid state can be found when considering a trapped Rashba-type spin-orbit coupled bosonic atoms loaded in a one-dimensional optical lattice. Furthermore, a skyrmion-anti-skyrmion lattice is associated with the appearance of such supersoildity, indicating the topological nontrivial properties of our proposed supersolids.

preprint2019arXiv

Cohomology dimension growth for Nakano $q$-semipositive line bundles

We study the cohomology with high tensor powers of Nakano $q$-semipositive line bundles on complex manifolds. We obtain the asymptotic estimates for the dimension of cohomology with high tensor powers of semipositive line bundles over q-convex manifolds and various possibly non-compact complex manifolds, in which the order of estimates are optimal. Besides, estimates for the modified Dirac operator on Nakano $q$-positive line bundle on almost complex manifolds are given.

preprint2019arXiv

Quantum oscillations and electronic structures in large Chern number semimetal RhSn

We report the magnetoresistance, Hall effect, de Haas-van Alphen (dHvA) oscillations and the electronic structures of single crystal RhSn, which is a typical material of CoSi family holding a large Chern number. The large unsaturated magnetoresistance is observed with B//[001]. The Hall resistivity curve indicates that RhSn is a multi-band system with high mobility. Evident quantum oscillations have been observed, from which the light effective masses are extracted. Ten fundamental frequencies are extracted after the fast Fourier transform analysis of the dHvA oscillations with B//[001] configuration. The two low frequencies F$_1$ and F$_2$ do not change obviously and the two high frequencies F$_9$ and F$_{10}$ evolve into four when B rotates from B//[001] to B//[110], which is consistent with the band structure in the first-principles calculations with spin-orbit coupling (SOC). The extracted Berry phases of the relative pockets show a good agreement with the Chern number $\pm4$ (with SOC) in the first-principles calculations. Above all, our studies indicate that RhSn is an ideal platform to study the unconventional chiral fermions and the surface states.

preprint2018arXiv

The growth of dimension of cohomology of semipositive line bundles on Hermitian manifolds

In this paper, we study the dimension of cohomology of semipositive line bundles over Hermitian manifolds, and obtain an asymptotic estimate for the dimension of the space of harmonic $(0,q)$-forms with values in high tensor powers of a semipositive line bundle when the fundamental estimate holds. As applications, we estimate the dimension of cohomology of semipositive line bundles on $q$-convex manifolds, pseudo-convex domains, weakly $1$-complete manifolds and complete manifolds. We also obtain the estimate of cohomology on compact manifolds with semipositive line bundles endowed with a Hermitian metric with analytic singularities and the related vanishing theorems.

preprint2015arXiv

On the growth of von Neumann dimension of harmonic spaces of semipositive line bundles over covering manifolds

We study the harmonic space of line bundle valued forms over a covering manifold with a discrete group action $Γ$, and obtain an asymptotic estimate for the $Γ$-dimension of the harmonic space with respect to the tensor times $k$ in the holomorphic line bundle $L^{k}\otimes E$ and the type $(n,q)$ of the differential form, when $L$ is semipositive. In particular, we estimate the $Γ$-dimension of the corresponding reduced $L^2$-Dolbeault cohomology group. Essentially, we obtain a local estimate of the pointwise norm of harmonic forms with valued in semipositive line bundles over Hermitian manifolds.