Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2026arXiv

AMSnet-q: Unsupervised Circuit Identification and Performance Labeling for AMS Circuits

Analog and mixed-signal (AMS) circuit design remains heavily reliant on expert knowledge. While recent AI-driven automation tools can generate candidate topologies, they critically depend on manually curated datasets with functional and performance annotations -- a requirement that current large language models (LLMs) and vision models cannot automate. Existing approaches still require domain experts to manually interpret circuit functionality. We present AMSnet-q, a fully automated, unsupervised pipeline that eliminates human-in-the-loop annotation by converting schematic images directly into a labeled AMS circuit database. Unlike prior work that stops at netlist extraction, our framework automates the complete verification loop: it performs schematic-to-netlist conversion, topology-aware testbench generation, and simulation-based sizing validation to objectively determine circuit functionality. Validated in 28 nm technology, AMSnet-q processed 739 schematics from the AMSnet 1.0 dataset, automatically constructing a repository of 4 circuit classes, 105 distinct topologies, and 89,789 labeled device configurations. By decoupling human effort from dataset volume and reducing the workload to a one-time testbench template per circuit class, AMSnet-q enables scalable, objective, and fully automated AMS database construction.

preprint2026arXiv

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase -- accounting for 50--80% of total step time -- is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently -- simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput -- up to 2--3 times higher than state-of-the-art systems on open-source benchmarks -- without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2--4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.

preprint2026arXiv

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.

preprint2024arXiv

K-theoretic classification of inductive limit actions of fusion categories on AF-algebras

We introduce a K-theoretic invariant for actions of unitary fusion categories on unital C*-algebras. We show that for inductive limits of finite dimensional actions of fusion categories on unital AF-algebras, this is a complete invariant. In particular, this gives a complete invariant for inductive limit actions of finite groups on AF-algebras. We apply our results to obtain a classification of finite depth, strongly AF-inclusions of unital AF-algebras.

preprint2024arXiv

Q-system completion for C* 2-categories

A Q-system in a C* 2-category is a unitary version of a separable Frobenius algebra object and can be viewed as a unitary version of a higher idempotent. We define a higher unitary idempotent completion for C* 2-categories called Q-system completion and study its properties. We show that the C* 2-category of right correspondences of unital C*-algebras is Q-system complete by constructing an inverse realization $†$ 2-functor. We use this result to construct induced actions of group theoretical unitary fusion categories on continuous trace C*-algebras with connected spectra.

preprint2024arXiv

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

preprint2023arXiv

Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level

In-memory key-value stores (IMKVSes) serve many online applications because of their efficiency. To support data backup, popular industrial IMKVSes periodically take a point-in-time snapshot of the in-memory data with the system call fork. However, this mechanism can result in latency spikes for queries arriving during the snapshot period because fork leads the engine into the kernel mode in which the engine is out-of-service for queries. In contrast to existing research focusing on optimizing snapshot algorithms, we optimize the fork operation to address the latency spikes problem from the operating system (OS) level, while keeping the data persistent mechanism in IMKVSes unchanged. Specifically, we first conduct an in-depth study to reveal the impact of the fork operation as well as the optimization techniques on query latency. Based on findings in the study, we propose Async-fork to offload the work of copying the page table from the engine (the parent process) to the child process as copying the page table dominates the execution time of fork. To keep data consistent between the parent and the child, we design the proactive synchronization strategy. Async-fork is implemented in the Linux kernel and deployed into the online Redis database in public clouds. Our experiment results show that compared with the default fork method in OS, Async-fork reduces the tail latency of queries arriving during the snapshot period by 81.76% on an 8GB instance and 99.84% on a 64GB instance.

preprint2023arXiv

High-resolution myelin-water fraction and quantitative relaxation mapping using 3D ViSTa-MR fingerprinting

Purpose: This study aims to develop a high-resolution whole-brain multi-parametric quantitative MRI approach for simultaneous mapping of myelin-water fraction (MWF), T1, T2, and proton-density (PD), all within a clinically feasible scan time. Methods: We developed 3D ViSTa-MRF, which combined Visualization of Short Transverse relaxation time component (ViSTa) technique with MR Fingerprinting (MRF), to achieve high-fidelity whole-brain MWF and T1/T2/PD mapping on a clinical 3T scanner. To achieve fast acquisition and memory-efficient reconstruction, the ViSTa-MRF sequence leverages an optimized 3D tiny-golden-angle-shuffling spiral-projection acquisition and joint spatial-temporal subspace reconstruction with optimized preconditioning algorithm. With the proposed ViSTa-MRF approach, high-fidelity direct MWF mapping was achieved without a need for multi-compartment fitting that could introduce bias and/or noise from additional assumptions or priors. Results: The in-vivo results demonstrate the effectiveness of the proposed acquisition and reconstruction framework to provide fast multi-parametric mapping with high SNR and good quality. The in-vivo results of 1mm- and 0.66mm-iso datasets indicate that the MWF values measured by the proposed method are consistent with standard ViSTa results that are 30x slower with lower SNR. Furthermore, we applied the proposed method to enable 5-minute whole-brain 1mm-iso assessment of MWF and T1/T2/PD mappings for infant brain development and for post-mortem brain samples. Conclusions: In this work, we have developed a 3D ViSTa-MRF technique that enables the acquisition of whole-brain MWF, quantitative T1, T2, and PD maps at 1mm and 0.66mm isotropic resolution in 5 and 15 minutes, respectively. This advancement allows for quantitative investigations of myelination changes in the brain.

preprint2022arXiv

A Space-Time Neural Network for Analysis of Stress Evolution under DC Current Stressing

The electromigration (EM)-induced reliability issues in very large scale integration (VLSI) circuits have attracted increased attention due to the continuous technology scaling. Traditional EM models often lead to overly pessimistic prediction incompatible with the shrinking design margin in future technology nodes. Motivated by the latest success of neural networks in solving differential equations in physical problems, we propose a novel mesh-free model to compute EM-induced stress evolution in VLSI circuits. The model utilizes a specifically crafted space-time physics-informed neural network (STPINN) as the solver for EM analysis. By coupling the physics-based EM analysis with dynamic temperature incorporating Joule heating and via effect, we can observe stress evolution along multi-segment interconnect trees under constant, time-dependent and space-time-dependent temperature during the void nucleation phase. The proposed STPINN method obviates the time discretization and meshing required in conventional numerical stress evolution analysis and offers significant computational savings. Numerical comparison with competing schemes demonstrates a 2x ~ 52x speedup with a satisfactory accuracy.

preprint2022arXiv

DCCF: Deep Comprehensible Color Filter Learning Framework for High-Resolution Image Harmonization

Image color harmonization algorithm aims to automatically match the color distribution of foreground and background images captured in different conditions. Previous deep learning based models neglect two issues that are critical for practical applications, namely high resolution (HR) image processing and model comprehensibility. In this paper, we propose a novel Deep Comprehensible Color Filter (DCCF) learning framework for high-resolution image harmonization. Specifically, DCCF first downsamples the original input image to its low-resolution (LR) counter-part, then learns four human comprehensible neural filters (i.e. hue, saturation, value and attentive rendering filters) in an end-to-end manner, finally applies these filters to the original input image to get the harmonized result. Benefiting from the comprehensible neural filters, we could provide a simple yet efficient handler for users to cooperate with deep model to get the desired results with very little effort when necessary. Extensive experiments demonstrate the effectiveness of DCCF learning framework and it outperforms state-of-the-art post-processing method on iHarmony4 dataset on images' full-resolutions by achieving 7.63% and 1.69% relative improvements on MSE and PSNR respectively.

preprint2022arXiv

Multilayer Perceptron Based Stress Evolution Analysis under DC Current Stressing for Multi-segment Wires

Electromigration (EM) is one of the major concerns in the reliability analysis of very large scale integration (VLSI) systems due to the continuous technology scaling. Accurately predicting the time-to-failure of integrated circuits (IC) becomes increasingly important for modern IC design. However, traditional methods are often not sufficiently accurate, leading to undesirable over-design especially in advanced technology nodes. In this paper, we propose an approach using multilayer perceptrons (MLP) to compute stress evolution in the interconnect trees during the void nucleation phase. The availability of a customized trial function for neural network training holds the promise of finding dynamic mesh-free stress evolution on complex interconnect trees under time-varying temperatures. Specifically, we formulate a new objective function considering the EM-induced coupled partial differential equations (PDEs), boundary conditions (BCs), and initial conditions to enforce the physics-based constraints in the spatial-temporal domain. The proposed model avoids meshing and reduces temporal iterations compared with conventional numerical approaches like FEM. Numerical results confirm its advantages on accuracy and computational performance.

preprint2022arXiv

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.

preprint2022arXiv

The Serverless Computing Survey: A Technical Primer for Design Architecture

The development of cloud infrastructures inspires the emergence of cloud-native computing. As the most promising architecture for deploying microservices, serverless computing has recently attracted more and more attention in both industry and academia. Due to its inherent scalability and flexibility, serverless computing becomes attractive and more pervasive for ever-growing Internet services. Despite the momentum in the cloud-native community, the existing challenges and compromises still wait for more advanced research and solutions to further explore the potentials of the serverless computing model. As a contribution to this knowledge, this article surveys and elaborates the research domains in the serverless context by decoupling the architecture into four stack layers: Virtualization, Encapsule, System Orchestration, and System Coordination. Inspired by the security model, we highlight the key implications and limitations of these works in each layer, and make suggestions for potential challenges to the field of future serverless computing.

preprint2022arXiv

VELTAIR: Towards High-Performance Multi-tenant Deep Learning Services via Adaptive Compilation and Scheduling

Deep learning (DL) models have achieved great success in many application domains. As such, many industrial companies such as Google and Facebook have acknowledged the importance of multi-tenant DL services. Although the multi-tenant service has been studied in conventional workloads, it is not been deeply studied on deep learning service, especially on general-purpose hardware. In this work, we systematically analyze the opportunities and challenges of providing multi-tenant deep learning services on the general-purpose CPU architecture from the aspects of scheduling granularity and code generation. We propose an adaptive granularity scheduling scheme to both guarantee resource usage efficiency and reduce the scheduling conflict rate. We also propose an adaptive compilation strategy, by which we can dynamically and intelligently pick a program with proper exclusive and shared resource usage to reduce overall interference-induced performance loss. Compared to the existing works, our design can serve more requests under the same QoS target in various scenarios (e.g., +71%, +62%, +45% for light, medium, and heavy workloads, respectively), and reduce the averaged query latency by 50%.

preprint2021arXiv

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have a large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. (2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation already achieves 70.5% and 69.2% top-1 accuracy on ImageNet without bells and whistles, outperforming recent state-of-the-art by reducing 54.9X and 45.7X computations against full-precision models. We hope FAT provides a novel perspective for model quantization. Code is available at \url{https://github.com/ChaofanTao/FAT_Quantization}.

preprint2020arXiv

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

The research interest in specialized hardware accelerators for deep neural networks (DNN) spikes recently owing to their superior performance and efficiency. However, today's DNN accelerators primarily focus on accelerating specific "kernels" such as convolution and matrix multiplication, which are vital but only part of an end-to-end DNN-enabled application. Meaningful speedups over the entire application often require supporting computations that are, while massively parallel, ill-suited to DNN accelerators. Integrating a general-purpose processor such as a CPU or a GPU incurs significant data movement overhead and leads to resource under-utilization on the DNN accelerators. We propose Simultaneous Multi-mode Architecture (SMA), a novel architecture design and execution model that offers general-purpose programmability on DNN accelerators in order to accelerate end-to-end applications. The key to SMA is the temporal integration of the systolic execution model with the GPU-like SIMD execution model. The SMA exploits the common components shared between the systolic-array accelerator and the GPU, and provides lightweight reconfiguration capability to switch between the two modes in-situ. The SMA achieves up to 63% performance improvement while consuming 23% less energy than the baseline Volta architecture with TensorCore.

preprint2020arXiv

Improving Web Content Blocking With Event-Loop-Turn Granularity JavaScript Signatures

Content blocking is an important part of a performant, user-serving, privacy respecting web. Most content blockers build trust labels over URLs. While useful, this approach has well understood shortcomings. Attackers may avoid detection by changing URLs or domains, bundling unwanted code with benign code, or inlining code in pages. The common flaw in existing approaches is that they evaluate code based on its delivery mechanism, not its behavior. In this work we address this problem with a system for generating signatures of the privacy-and-security relevant behavior of executed JavaScript. Our system considers script behavior during each turn on the JavaScript event loop. Focusing on event loop turns allows us to build signatures that are robust against code obfuscation, code bundling, URL modification, and other common evasions, as well as handle unique aspects of web applications. This work makes the following contributions to improving content blocking: First, implement a novel system to build per-event-loop-turn signatures of JavaScript code by instrumenting the Blink and V8 runtimes. Second, we apply these signatures to measure filter list evasion, by using EasyList and EasyPrivacy as ground truth and finding other code that behaves identically. We build ~2m signatures of privacy-and-security behaviors from 11,212 unique scripts blocked by filter lists, and find 3,589 more unique scripts including the same harmful code, affecting 12.48% of websites measured. Third, we taxonomize common filter list evasion techniques. Finally, we present defenses; filter list additions where possible, and a proposed, signature based system in other cases. We share the implementation of our signature-generation system, the dataset from applying our system to the Alexa 100K, and 586 AdBlock Plus compatible filter list rules to block instances of currently blocked code being moved to new URLs.

preprint2020arXiv

Topology Virtualization and Dynamics Shielding Method for LEO Satellite Networks

Virtual Node (VN) method is widely adopted to handle satellite network topological dynamics. However, conventional VN method is insufficient when earth rotation and inter-plane phase difference are considered. An improved VN method based on Celestial Sphere Division is proposed to overcome the defects of the conventional method. An optimized inter-satellite link connecting mode is derived to achieve maximal available links. The optimal VN division solution and addressing scheme are designed to generate a nearly static virtual network and solve the asynchronous switches caused by inter-plane phase difference. Comparison results demonstrate the advantages of proposed method.

preprint2020arXiv

Towards QoS-Aware and Resource-Efficient GPU Microservices Based on Spatial Multitasking GPUs In Datacenters

While prior researches focus on CPU-based microservices, they are not applicable for GPU-based microservices due to the different contention patterns. It is challenging to optimize the resource utilization while guaranteeing the QoS for GPU microservices. We find that the overhead is caused by inter microservice communication, GPU resource contention and imbalanced throughput within microservice pipeline. We propose Camelot, a runtime system that manages GPU micorservices considering the above factors. In Camelot, a global memory-based communication mechanism enables onsite data sharing that significantly reduces the end-to-end latencies of user queries. We also propose two contention aware resource allocation policies that either maximize the peak supported service load or minimize the resource usage at low load while ensuring the required QoS. The two policies consider the microservice pipeline effect and the runtime GPU resource contention when allocating resources for the microservices. Compared with state-of-the-art work, Camelot increases the supported peak load by up to 64.5% with limited GPUs, and reduces 35% resource usage at low load while achieving the desired 99%-ile latency target.