Source author record

Yifan Sun

Yifan Sun appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

34works

22topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Code as Agent Harness

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

preprint2026arXiv

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

Large Language Models (LLMs) are prone to factual hallucinations, risking their reliability in real-world applications. Existing hallucination detectors mainly extract micro-level intrinsic patterns for uncertainty quantification or elicit macro-level self-judgments through verbalized prompts. However, these methods address only a single facet of the hallucination, focusing either on implicit neural uncertainty or explicit symbolic reasoning, thereby treating these inherently coupled behaviors in isolation and failing to exploit their interdependence for a holistic view. In this paper, we propose LaaB (Logical Consistency-as-a-Bridge), a framework that bridges neural features and symbolic judgments for hallucination detection. LaaB introduces a "meta-judgment" process to map symbolic labels back into the feature space. By leveraging the inherent logical bridge where response and meta-judgment labels are either the same or opposite based on the self-judgment's semantics, LaaB aligns and integrates dual-view signals via mutual learning and enhances the hallucination detection. Extensive experiments on 4 public datasets, across 4 LLMs, against 8 baselines demonstrate the superiority of LaaB.

preprint2026arXiv

SPARK: Safe Protective and Assistive Robot Kit

This paper introduces the Safe Protective and Assistive Robot Kit (SPARK), a comprehensive benchmark designed to ensure safety in humanoid autonomy and teleoperation. Humanoid robots pose significant safety risks due to their physical capabilities of interacting with complex environments. The physical structures of humanoid robots further add complexity to the design of general safety solutions. To facilitate safe deployment of complex robot systems, SPARK can be used as a toolbox that comes with state-of-the-art safe control algorithms in a modular and composable robot control framework. Users can easily configure safety criteria and sensitivity levels to optimize the balance between safety and performance. To accelerate humanoid safety research and development, SPARK provides simulation benchmarks that compare safety approaches in a variety of environments, tasks, and robot models. Furthermore, SPARK allows quick deployment of synthesized safe controllers on real robots. For hardware deployment, SPARK supports Apple Vision Pro (AVP) or a Motion Capture System as external sensors, while offering interfaces for seamless integration with alternative hardware setups at the same time. This paper demonstrates SPARK's capability with both simulation experiments and case studies with a Unitree G1 humanoid robot. Leveraging these advantages of SPARK, users and researchers can significantly improve the safety of their humanoid systems as well as accelerate relevant research. The open source code is available at: https://github.com/intelligent-control-lab/spark.

preprint2026arXiv

StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models

Inferring the evolution of high-dimensional and multi-modal (e.g., spatio-temporal) physical fields from irregular sparse measurements in real time is a fundamental challenge in science and engineering. Existing approaches, including diffusion-based generative models and functional tensor methods, typically operate in offline settings, depend on full temporal observations, or incur substantial inference cost. We propose StreamPhy, an end-to-end framework that enables efficient and accurate streaming inference of full-field physical dynamics from incoming irregular sparse measurements. The framework integrates a data-adaptive observation encoder that is robust to arbitrary observation patterns, a structured state-space model that supports memory-efficient online updates across irregular time intervals, and an expressive Functional Tensor Feature-wise Linear Modulation (FT-FiLM) decoder for continuous-field generation. We prove that FT-FiLM is more expressive than the functional Tucker model, admitting a richer function class for handling complex dynamics. Experiments on three representative physical systems under challenging sampling patterns show that StreamPhy consistently outperforms state-of-the-art baselines, with at least 48\% improvement in accuracy and up to 20--100X faster inference than diffusion-based methods.

preprint2026arXiv

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

Prototype-based Personalized Federated Learning (ProtoPFL) enables efficient multi-domain adaptation by communicating compact class prototypes, but directly sharing them poses privacy risks. A common defense involves per-example $\ell_2$ clipping before prototype computation to bound sensitivity, followed by isotropic Gaussian noise to enforce Local Differential Privacy (LDP). However, Isotropic Gaussian Prototype Perturbation (IGPP) typically over-perturbs discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. In this paper, we propose VPDR, a client-side privacy plug-in that seamlessly integrates into existing ProtoPFLs. Motivated by the observation that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which allocates less noise to discriminative subspaces, preserving semantic separability while ensuring privacy. We further develop Distillation-guided Clipping Regularization (DCR), which enables feature norms to adaptively concentrate near the predefined clipping threshold while maintaining prediction consistency. Theoretical analysis shows that our groupwise mechanism provides privacy guarantees no weaker than the isotropic baseline under the same privacy constraints. Extensive experiments on multi-domain benchmarks demonstrate that VPDR achieves a superior privacy-utility trade-off, outperforming IGPP in personalized federated fine-tuning without sacrificing robustness against realistic attacks.

preprint2026arXiv

Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market's stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.

preprint2024arXiv

MS-DETR: Efficient DETR Training with Mixed Supervision

DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly supervising the candidate generation procedure through mixing one-to-one supervision and one-to-many supervision. Our approach, namely MS-DETR, is simple, and places one-to-many supervision to the object queries of the primary decoder that is used for inference. In comparison to existing DETR variants with one-to-many supervision, such as Group DETR and Hybrid DETR, our approach does not need additional decoder branches or object queries. The object queries of the primary decoder in our approach directly benefit from one-to-many supervision and thus are superior in object candidate prediction. Experimental results show that our approach outperforms related DETR variants, such as DN-DETR, Hybrid DETR, and Group DETR, and the combination with related DETR variants further improves the performance.

preprint2022arXiv

Accelerating Frank-Wolfe via Averaging Step Directions

The Frank-Wolfe method is a popular method in sparse constrained optimization, due to its fast per-iteration complexity. However, the tradeoff is that its worst case global convergence is comparatively slow, and importantly, is fundamentally slower than its flow rate--that is to say, the convergence rate is throttled by discretization error. In this work, we consider a modified Frank-Wolfe where the step direction is a simple weighted average of past oracle calls. This method requires very little memory and computational overhead, and provably decays this discretization error term. Numerically, we show that this method improves the convergence rate over several problems, especially after the sparse manifold has been detected. Theoretically, we show the method has an overall global convergence rate of $O(1/k^p)$, where $0< p < 1$; after manifold identification, this rate speeds to $O(1/k^{3p/2})$. We also observe that the method achieves this accelerated rate from a very early stage, suggesting a promising mode of acceleration for this family of methods.

preprint2022arXiv

Approximate Frank-Wolfe Algorithms over Graph-structured Support Sets

In this paper, we consider approximate Frank-Wolfe (FW) algorithms to solve convex optimization problems over graph-structured support sets where the linear minimization oracle (LMO) cannot be efficiently obtained in general. We first demonstrate that two popular approximation assumptions (additive and multiplicative gap errors) are not applicable in that no cheap gap-approximate LMO oracle exists. Thus, approximate dual maximization oracles (DMO) are proposed, which approximate the inner product rather than the gap. We prove that the standard FW method using a $δ$-approximate DMO converges as $\mathcal{O}((1-δ) \sqrt{s}/δ)$ in the worst case, and as $\mathcal{O}(L/(δ^2 t))$ over a $δ$-relaxation of the constraint set. Furthermore, when the solution is on the boundary, a variant of FW converges as $\mathcal{O}(1/t^2)$ under the quadratic growth assumption. Our empirical results suggest that even these improved bounds are pessimistic, showing fast convergence in recovering real-world images with graph-structured sparsity.

preprint2022arXiv

Bridging the Source-to-target Gap for Cross-domain Person Re-Identification with Intermediate Domains

Cross-domain person re-identification (re-ID), such as unsupervised domain adaptive (UDA) re-ID, aims to transfer the identity-discriminative knowledge from the source to the target domain. Existing methods commonly consider the source and target domains are isolated from each other, i.e., no intermediate status is modeled between both domains. Directly transferring the knowledge between two isolated domains can be very difficult, especially when the domain gap is large. From a novel perspective, we assume these two domains are not completely isolated, but can be connected through intermediate domains. Instead of directly aligning the source and target domains against each other, we propose to align the source and target domains against their intermediate domains for a smooth knowledge transfer. To discover and utilize these intermediate domains, we propose an Intermediate Domain Module (IDM) and a Mirrors Generation Module (MGM). IDM has two functions: 1) it generates multiple intermediate domains by mixing the hidden-layer features from source and target domains and 2) it dynamically reduces the domain gap between the source / target domain features and the intermediate domain features. While IDM achieves good domain alignment, it introduces a side effect, i.e., the mix-up operation may mix the identities into a new identity and lose the original identities. To compensate this, MGM is introduced by mapping the features into the IDM-generated intermediate domains without changing their original identity. It allows to focus on minimizing domain variations to promote the alignment between the source / target domain and intermediate domains, which reinforces IDM into IDM++. We extensively evaluate our method under both the UDA and domain generalization (DG) scenarios and observe that IDM++ yields consistent performance improvement for cross-domain re-ID, achieving new state of the art.

preprint2022arXiv

Continuous Time Frank-Wolfe Does Not Zig-Zag, But Multistep Methods Do Not Accelerate

The Frank-Wolfe algorithm has regained much interest in its use in structurally constrained machine learning applications. However, one major limitation of the Frank-Wolfe algorithm is the slow local convergence property due to the zig-zagging behavior. We observe that this zig-zagging phenomenon can be viewed as an artifact of discretization, as when the method is viewed as an Euler discretization of a continuous time flow, that flow does not zig-zag. For this reason, we propose multistep Frank-Wolfe variants based on discretizations of the same flow whose truncation errors decay as $O(Δ^p)$, where $p$ is the method's order. This strategy "stabilizes" the method, and allows tools like line search and momentum to have more benefit. However, in terms of a convergence rate, our result is ultimately negative, suggesting that no Runge-Kutta-type discretization scheme can achieve a better convergence rate than the vanilla Frank-Wolfe method. We believe that this analysis adds to the growing knowledge of flow analysis for optimization methods, and is a cautionary tale on the ultimate usefulness of multistep methods.

preprint2022arXiv

Embodied Self-supervised Learning by Coordinated Sampling and Training

Self-supervised learning can significantly improve the performance of downstream tasks, however, the dimensions of learned representations normally lack explicit physical meanings. In this work, we propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process so that the learned representations can have explicit physical meanings. The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training. At the sampling step, given observed data, the inference network is used to approximate the intractable posterior, from which we sample input parameters and feed them to a physical process to generate data in the observational space; At the training step, the same network is optimized with the sampled paired data. We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech. Given an articulatory synthesizer, an inference model can be trained completely from scratch with random initialization. Our experiments demonstrate that the proposed method can converge steadily and the network learns to control the articulatory synthesizer to speak like a human. We also demonstrate that trained models can generalize well to unseen speakers or even new languages, and performance can be further improved through self-adaptation.

preprint2022arXiv

Multimode soliton collisions in graded-index optical fibers

In this work, we unveil the unique complex dynamics of multimode soliton interactions in graded-index optical fibers through simulations and experiments. By generating two multimode solitons from the fission of an input femtosecond pulse, we examine the evolution of their Raman-induced red-shift when the input pulse energy grows larger. Remarkably, we find that the output red-shift of the trailing multimode soliton may be reduced, so that it accelerates until it collides with the leading multimode soliton. As a result of the inelastic collision, a significant energy transfer occurs between the two multimode solitons: the trailing soliton captures energy from the leading soliton, which ultimately enhances its red-shift, thus increasing temporal separation between the two multimode solitons.

preprint2022arXiv

Results and findings of the 2021 Image Similarity Challenge

The 2021 Image Similarity Challenge introduced a dataset to serve as a new benchmark to evaluate recent image copy detection methods. There were 200 participants to the competition. This paper presents a quantitative and qualitative analysis of the top submissions. It appears that the most difficult image transformations involve either severe image crops or hiding into unrelated images, combined with local pixel perturbations. The key algorithmic elements in the winning submissions are: training on strong augmentations, self-supervised learning, score normalization, explicit overlay detection, and global descriptor matching followed by pairwise image comparison.

preprint2022arXiv

UFO: Unified Feature Optimization

This paper proposes a novel Unified Feature Optimization (UFO) paradigm for training and deploying deep models under real-world and large-scale scenarios, which requires a collection of multiple AI functions. UFO aims to benefit each single task with a large-scale pretraining on all tasks. Compared with the well known foundation model, UFO has two different points of emphasis, i.e., relatively smaller model size and NO adaptation cost: 1) UFO squeezes a wide range of tasks into a moderate-sized unified model in a multi-task learning manner and further trims the model size when transferred to down-stream tasks. 2) UFO does not emphasize transfer to novel tasks. Instead, it aims to make the trimmed model dedicated for one or more already-seen task. With these two characteristics, UFO provides great convenience for flexible deployment, while maintaining the benefits of large-scale pretraining. A key merit of UFO is that the trimming process not only reduces the model size and inference consumption, but also even improves the accuracy on certain tasks. Specifically, UFO considers the multi-task training and brings two-fold impact on the unified model: some closely related tasks have mutual benefits, while some tasks have conflicts against each other. UFO manages to reduce the conflicts and to preserve the mutual benefits through a novel Network Architecture Search (NAS) method. Experiments on a wide range of deep representation learning tasks (i.e., face recognition, person re-identification, vehicle re-identification and product retrieval) show that the model trimmed from UFO achieves higher accuracy than its single-task-trained counterpart and yet has smaller model size, validating the concept of UFO. Besides, UFO also supported the release of 17 billion parameters computer vision (CV) foundation model which is the largest CV model in the industry.

preprint2022arXiv

V$^2$L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval

Product retrieval is of great importance in the ecommerce domain. This paper introduces our 1st-place solution in eBay eProduct Visual Search Challenge (FGVC9), which is featured for an ensemble of about 20 models from vision models and vision-language models. While model ensemble is common, we show that combining the vision models and vision-language models brings particular benefits from their complementarity and is a key factor to our superiority. Specifically, for the vision models, we use a two-stage training pipeline which first learns from the coarse labels provided in the training set and then conducts fine-grained self-supervised training, yielding a coarse-to-fine metric learning manner. For the vision-language models, we use the textual description of the training image as the supervision signals for fine-tuning the image-encoder (feature extractor). With these designs, our solution achieves 0.7623 MAR@10, ranking the first place among all the competitors. The code is available at: \href{https://github.com/WangWenhao0716/V2L}{V$^2$L}.

preprint2022arXiv

Visualization Design Practices in a Crisis: Behind the Scenes with COVID-19 Dashboard Creators

During the COVID-19 pandemic, a number of data visualizations were created to inform the public about the rapidly evolving crisis. Data dashboards, a form of information dissemination used during the pandemic, have facilitated this process by visualizing statistics regarding the number of COVID-19 cases over time. In this research, we conducted a qualitative interview study among dashboard creators from federal agencies, state health departments, mainstream news media outlets, and other organizations that created (often widely-used) COVID-19 dashboards to answer the following questions: how did visualization creators engage in COVID-19 dashboard design, and what tensions, conflicts, and challenges arose during this process? Our findings detail the trajectory of design practices -- from creation to expansion, maintenance, and termination -- that are shaped by the complex interplay between design goals, tools and technologies, labor, emerging crisis contexts, and public engagement. We particularly examined the tensions between designers and the general public involved in these processes. These conflicts, which often materialized due to a divergence between public demands and standing policies, centered around the type and amount of information to be visualized, how public perceptions shape and are shaped by visualization design, and the strategies utilized to deal with (potential) misinterpretations and misuse of visualizations. Our findings and lessons learned shed light on new ways of thinking in visualization design, focusing on the bundled activities that are invariably involved in human and nonhuman participation throughout the entire trajectory of design practice.

preprint2021arXiv

High-performance parallel classical scheme for simulating shallow quantum circuits

Recently, constant-depth quantum circuits are proved more powerful than their classical counterparts at solving certain problems, e.g., the two-dimensional (2D) hidden linear function (HLF) problem regarding a symmetric binary matrix. To further investigate the boundary between classical and quantum computing models, in this work we propose a high-performance two-stage classical scheme to solve a full-sampling variant of the 2D HLF problem, which combines traditional classical parallel algorithms and a gate-based classical circuit model together for exactly simulating the target shallow quantum circuits. Under reasonable parameter assumptions, a theoretical analysis reveals our classical simulator consumes less runtime than that of near-term quantum processors for most problem instances. Furthermore, we demonstrate the typical all-connected 2D grid instances by moderate FPGA circuits, and show our designed parallel scheme is a practically scalable, high-efficient and operationally convenient tool for simulating and verifying graph-state circuits performed by current quantum hardware.

preprint2021arXiv

Mapping the Landscape of COVID-19 Crisis Visualizations

In response to COVID-19, a vast number of visualizations have been created to communicate information to the public. Information exposure in a public health crisis can impact people's attitudes towards and responses to the crisis and risks, and ultimately the trajectory of a pandemic. As such, there is a need for work that documents, organizes, and investigates what COVID-19 visualizations have been presented to the public. We address this gap through an analysis of 668 COVID-19 visualizations. We present our findings through a conceptual framework derived from our analysis, that examines who, (uses) what data, (to communicate) what messages, in what form, under what circumstances in the context of COVID-19 crisis visualizations. We provide a set of factors to be considered within each component of the framework. We conclude with directions for future crisis visualization research.

preprint2020arXiv

Approximate methods for phase retrieval via gauge duality

We consider the problem of finding a low rank symmetric matrix satisfying a system of linear equations, as appears in phase retrieval. In particular, we solve the gauge dual formulation, but use a fast approximation of the spectral computations to achieve a noisy solution estimate. This estimate is then used as the initialization of an alternating gradient descent scheme over a nonconvex rank-1 matrix factorization formulation. Numerical results on small problems show consistent recovery, with very low computational cost.

preprint2020arXiv

Circle Loss: A Unified Perspective of Pair Similarity Optimization

This paper provides a pair similarity optimization viewpoint on deep feature learning, aiming to maximize the within-class similarity $s_p$ and minimize the between-class similarity $s_n$. We find a majority of loss functions, including the triplet loss and the softmax plus cross-entropy loss, embed $s_n$ and $s_p$ into similarity pairs and seek to reduce $(s_n-s_p)$. Such an optimization manner is inflexible, because the penalty strength on every single similarity score is restricted to be equal. Our intuition is that if a similarity score deviates far from the optimum, it should be emphasized. To this end, we simply re-weight each similarity to highlight the less-optimized similarity scores. It results in a Circle loss, which is named due to its circular decision boundary. The Circle loss has a unified formula for two elemental deep feature learning approaches, i.e. learning with class-level labels and pair-wise labels. Analytically, we show that the Circle loss offers a more flexible optimization approach towards a more definite convergence target, compared with the loss functions optimizing $(s_n-s_p)$. Experimentally, we demonstrate the superiority of the Circle loss on a variety of deep feature learning tasks. On face recognition, person re-identification, as well as several fine-grained image retrieval datasets, the achieved performance is on par with the state of the art.

preprint2020arXiv

CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions

This paper proposes a self-supervised learning method for the person re-identification (re-ID) problem, where existing unsupervised methods usually rely on pseudo labels, such as those from video tracklets or clustering. A potential drawback of using pseudo labels is that errors may accumulate and it is challenging to estimate the number of pseudo IDs. We introduce a different unsupervised method that allows us to learn pedestrian embeddings from raw videos, without resorting to pseudo labels. The goal is to construct a self-supervised pretext task that matches the person re-ID objective. Inspired by the \emph{data association} concept in multi-object tracking, we propose the \textbf{Cyc}le \textbf{As}sociation (\textbf{CycAs}) task: after performing data association between a pair of video frames forward and then backward, a pedestrian instance is supposed to be associated to itself. To fulfill this goal, the model must learn a meaningful representation that can well describe correspondences between instances in frame pairs. We adapt the discrete association process to a differentiable form, such that end-to-end training becomes feasible. Experiments are conducted in two aspects: We first compare our method with existing unsupervised re-ID methods on seven benchmarks and demonstrate CycAs' superiority. Then, to further validate the practical value of CycAs in real-world applications, we perform training on self-collected videos and report promising performance on standard test sets.

preprint2020arXiv

Deep Representation Learning on Long-tailed Data: A Learnable Embedding Augmentation Perspective

This paper considers learning deep features from long-tailed data. We observe that in the deep feature space, the head classes and the tail classes present different distribution patterns. The head classes have a relatively large spatial span, while the tail classes have significantly small spatial span, due to the lack of intra-class diversity. This uneven distribution between head and tail classes distorts the overall feature space, which compromises the discriminative ability of the learned features. Intuitively, we seek to expand the distribution of the tail classes by transferring from the head classes, so as to alleviate the distortion of the feature space. To this end, we propose to construct each feature into a "feature cloud". If a sample belongs to a tail class, the corresponding feature cloud will have relatively large distribution range, in compensation to its lack of diversity. It allows each tail sample to push the samples from other classes far away, recovering the intra-class diversity of tail classes. Extensive experimental evaluations on person re-identification and face recognition tasks confirm the effectiveness of our method.

preprint2020arXiv

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUs' memory. This leads to inefficient use of precious GPU memory. Second, to maintain coherency across an MGPU system, transferring data using low-bandwidth and high-latency off-chip links leads to degradation in system performance. Third, since the programmer needs to manually maintain data coherence, the programming of an MGPU system to maximize its throughput is extremely challenging. To address the above issues, we propose a novel lightweight timestamp-based coherence protocol, HALCONE, for MGPU systems and modify the memory hierarchy of the GPUs to support physically shared memory. HALCONE replaces the Compute Unit (CU) level logical time counters with cache level logical time counters to reduce coherence traffic. Furthermore, HALCONE introduces a novel timestamp storage unit (TSU) with no additional performance overhead in the main memory to perform coherence actions. Our proposed HALCONE protocol maintains the data coherence in the memory hierarchy of the MGPU with minimal performance overhead (less than 1\%). Using a set of standard MGPU benchmarks, we observe that a 4-GPU MGPU system with shared memory and HALCONE performs, on average, 4.6$\times$ and 3$\times$ better than a 4-GPU MGPU system with existing RDMA and with the recently proposed HMG coherence protocol, respectively. We demonstrate the scalability of HALCONE using different GPU counts (2, 4, 8, and 16) and different CU counts (32, 48, and 64 CUs per GPU) for 11 standard benchmarks.

preprint2020arXiv

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPU count because of the overhead of data movement across multiple GPUs. Moreover, a lack of hardware support for coherency exacerbates the problem because a programmer must either replicate the data across GPUs or fetch the remote data using high-overhead off-chip links. To address these problems, we propose a multi-GPU system with truly shared memory (MGPU-TSM), where the main memory is physically shared across all the GPUs. We eliminate remote accesses and avoid data replication using an MGPU-TSM system, which simplifies the memory hierarchy. Our preliminary analysis shows that MGPU-TSM with 4 GPUs performs, on average, 3.9x? better than the current best performing multi-GPU configuration for standard application benchmarks.

preprint2020arXiv

Mode-Locking of the Hermite-Gaussian Modes of a Nanolaser

Mode-locking is predicted in a nanolaser cavity forming an effective photonic harmonic potential. The cavity is substantially more compact than a Fabry-Perot resonator with comparable pulsing period, which is here controlled by the potential. In the limit of instantaneous gain and absorption saturation, mode-locking corresponds to a stable dissipative soliton, which it very well approximated by the coherent state of a quantum mechanical harmonic oscillator. This property is robust against non-instantaneous material response and non-zero phase-intensity coupling.

preprint2020arXiv

Robust Initial Alignment for SINS/DVL Based on Reconstructed Observation Vectors

Misalignment angle will result in a considerable error for the integration of Doppler Velocity Log (DVL) and of Strapdown Inertial Navigation System (SINS). In this paper, a robust initial alignment method for SINS/DVL is proposed to solve a practical applicable issue, which is that the outputs of DVL are often corrupted by the outliers. Firstly, the alignment principle for SINS/DVL is summarized. Secondly, based on the principle of this alignment method, the apparent velocity model is investigated, and the parameters expression of the apparent velocity model are derived detailed. Using the apparent velocity model, the unknown parameters of the apparent velocity model are estimated by the developed Robust Kalman Filter (RKF), then the reconstructed observation vector, where the outliers are detected and isolated, is reconstructed by the estimated parameters. Based on the reconstructed observation vectors, the initial attitude is determined. Finally, the simulation and field tests are carried out to verify the performance of the proposed method. The test results are shown that the proposed method can detect and isolate the outliers effectively and get better performance than the previous work.

preprint2020arXiv

Safe Screening for the Generalized Conditional Gradient Method

The conditional gradient method (CGM) has been widely used for fast sparse approximation, having a low per iteration computational cost for structured sparse regularizers. We explore the sparsity acquiring properties of a generalized CGM (gCGM), where the constraint is replaced by a penalty function based on a gauge penalty; this can be done without significantly increasing the per-iteration computation, and applies to general notions of sparsity. Without assuming bounded iterates, we show $O(1/t)$ convergence of the function values and gap of gCGM. We couple this with a safe screening rule, and show that at a rate $O(1/(tδ^2))$, the screened support matches the support at the solution, where $δ\geq 0$ measures how close the problem is to being degenerate. In our experiments, we show that the gCGM for these modified penalties have similar feature selection properties as common penalties, but with potentially more stability over the choice of hyperparameter.

preprint2020arXiv

Summarizing CPU and GPU Design Trends with Product Data

Moore's Law and Dennard Scaling have guided the semiconductor industry for the past few decades. Recently, both laws have faced validity challenges as transistor sizes approach the practical limits of physics. We are interested in testing the validity of these laws and reflect on the reasons responsible. In this work, we collect data of more than 4000 publicly-available CPU and GPU products. We find that transistor scaling remains critical in keeping the laws valid. However, architectural solutions have become increasingly important and will play a larger role in the future. We observe that GPUs consistently deliver higher performance than CPUs. GPU performance continues to rise because of increases in GPU frequency, improvements in the thermal design power (TDP), and growth in die size. But we also see the ratio of GPU to CPU performance moving closer to parity, thanks to new SIMD extensions on CPUs and increased CPU core counts.

preprint2015arXiv

Bell's measure and implementing quantum Fourier transform with orbital angular momentum of classical light

We perform Bell's measurement and perform quantum Fourier transform with the classical vortex beam. The violation of Bell's inequality for such a non-separable classical correlation has been demonstrated experimentally. Based on the classical vortex beam and nonquantum entanglement between the polarization and orbital angular momentum, the Hadamard gates and conditional phase gates have been designed. Furthermore, a quantum Fourier transform has been implemented experimentally, which is the crucial final step in Shor's algorithm

preprint2013arXiv

Decomposition in conic optimization with partially separable structure

Decomposition techniques for linear programming are difficult to extend to conic optimization problems with general non-polyhedral convex cones because the conic inequalities introduce an additional nonlinear coupling between the variables. However in many applications the convex cones have a partially separable structure that allows them to be characterized in terms of simpler lower-dimensional cones. The most important example is sparse semidefinite programming with a chordal sparsity pattern. Here partial separability derives from the clique decomposition theorems that characterize positive semidefinite and positive-semidefinite-completable matrices with chordal sparsity patterns. The paper describes a decomposition method that exploits partial separability in conic linear optimization. The method is based on Spingarn's method for equality constrained convex optimization, combined with a fast interior-point method for evaluating proximal operators.

preprint2012arXiv

Following states in temperature in the spherical s+p-spin glass model

In many mean-field glassy systems, the low-temperature Gibbs measure is dominated by exponentially many metastable states. We analyze the evolution of the metastable states as temperature changes adiabatically in the solvable case of the spherical $s+p$-spin glass model, extending the work of Barrat, Franz and Parisi J. Phys. A 30, 5593 (1997). We confirm the presence of level crossings, bifurcations, and temperature chaos. For the states that are at equilibrium close to the so-called dynamical temperature $T_d$, we find, however, that the following state method (and the dynamical solution of the model as well) is intrinsically limited by the vanishing of solutions with non-zero overlap at low temperature.

preprint2012arXiv

Probabilistic Reconstruction in Compressed Sensing: Algorithms, Phase Diagrams, and Threshold Achieving Matrices

Compressed sensing is a signal processing method that acquires data directly in a compressed form. This allows one to make less measurements than what was considered necessary to record a signal, enabling faster or more precise measurement protocols in a wide range of applications. Using an interdisciplinary approach, we have recently proposed in [arXiv:1109.4424] a strategy that allows compressed sensing to be performed at acquisition rates approaching to the theoretical optimal limits. In this paper, we give a more thorough presentation of our approach, and introduce many new results. We present the probabilistic approach to reconstruction and discuss its optimality and robustness. We detail the derivation of the message passing algorithm for reconstruction and expectation max- imization learning of signal-model parameters. We further develop the asymptotic analysis of the corresponding phase diagrams with and without measurement noise, for different distribution of signals, and discuss the best possible reconstruction performances regardless of the algorithm. We also present new efficient seeding matrices, test them on synthetic data and analyze their performance asymptotically.

preprint2012arXiv

Statistical physics-based reconstruction in compressed sensing

Compressed sensing is triggering a major evolution in signal acquisition. It consists in sampling a sparse signal at low rate and later using computational power for its exact reconstruction, so that only the necessary information is measured. Currently used reconstruction techniques are, however, limited to acquisition rates larger than the true density of the signal. We design a new procedure which is able to reconstruct exactly the signal with a number of measurements that approaches the theoretical limit in the limit of large systems. It is based on the joint use of three essential ingredients: a probabilistic approach to signal reconstruction, a message-passing algorithm adapted from belief propagation, and a careful design of the measurement matrix inspired from the theory of crystal nucleation. The performance of this new algorithm is analyzed by statistical physics methods. The obtained improvement is confirmed by numerical studies of several cases.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.07384:author:2:yifan-sun

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.18747:author:21:yifan-sun

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2604.27833:author:5:yifan-sun

Imported May 20, 2026Synced May 20, 2026

arxivconfidence 95%

external id: arxiv:2605.03971:author:5:yifan-sun

Imported May 20, 2026Synced May 20, 2026

4 works

Yi Yang

Researcher

Yi Yang contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

David Kaeli

Researcher

David Kaeli contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Florent Krzakala

Researcher

Florent Krzakala contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Lenka Zdeborová

Researcher

Lenka Zdeborová contributes to research discovery and scholarly infrastructure.

Open to collaborate

Yifan Sun

What is connected

Connect this record

See the researcher in context

Building this map preview

34 published item(s)

Code as Agent Harness

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

SPARK: Safe Protective and Assistive Robot Kit

StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

MS-DETR: Efficient DETR Training with Mixed Supervision

Accelerating Frank-Wolfe via Averaging Step Directions

Approximate Frank-Wolfe Algorithms over Graph-structured Support Sets

Bridging the Source-to-target Gap for Cross-domain Person Re-Identification with Intermediate Domains

Continuous Time Frank-Wolfe Does Not Zig-Zag, But Multistep Methods Do Not Accelerate

Embodied Self-supervised Learning by Coordinated Sampling and Training

Multimode soliton collisions in graded-index optical fibers

Results and findings of the 2021 Image Similarity Challenge

UFO: Unified Feature Optimization

V$^2$L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval

Visualization Design Practices in a Crisis: Behind the Scenes with COVID-19 Dashboard Creators

High-performance parallel classical scheme for simulating shallow quantum circuits

Mapping the Landscape of COVID-19 Crisis Visualizations

Approximate methods for phase retrieval via gauge duality

Circle Loss: A Unified Perspective of Pair Similarity Optimization

CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions

Deep Representation Learning on Long-tailed Data: A Learnable Embedding Augmentation Perspective

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

Mode-Locking of the Hermite-Gaussian Modes of a Nanolaser

Robust Initial Alignment for SINS/DVL Based on Reconstructed Observation Vectors

Safe Screening for the Generalized Conditional Gradient Method

Summarizing CPU and GPU Design Trends with Product Data

Bell's measure and implementing quantum Fourier transform with orbital angular momentum of classical light

Decomposition in conic optimization with partially separable structure

Following states in temperature in the spherical s+p-spin glass model

Probabilistic Reconstruction in Compressed Sensing: Algorithms, Phase Diagrams, and Threshold Achieving Matrices

Statistical physics-based reconstruction in compressed sensing