Source author record

Jingwei Xu

Jingwei Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Information Theory math.IT Distributed, Parallel, and Cluster Computing Machine Learning Performance Artificial Intelligence eess.IV eess.SP math.CO Operating Systems physics.optics Software Engineering

Catalog footprint

What is connected

13works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline

Client-side metadata caching has long been considered an effective method for accelerating metadata operations in distributed file systems (DFSs). However, we have found that client-side state (e.g., caching) is not only ineffective but also consumes valuable memory resources in the deep learning pipelines. We thus propose FalconFS, a DFS optimized for deep learning pipelines with the stateless-client architecture. Specifically, instead of performing client-side path resolution and caching, FalconFS efficiently resolves paths on the server side using hybrid metadata indexing and lazy namespace replication. FalconFS also boosts server concurrency with concurrent request merging and provides easy deployment with VFS shortcut. Evaluations against CephFS and Lustre show that FalconFS achieves up to 5.72$\times$ throughput for small file read/write and up to 12.81$\times$ throughput for deep learning model training. FalconFS has been running in Huawei autonomous driving system's production environment with 10,000 NPUs for one year and has been open-sourced.

preprint2026arXiv

Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

Automated theorem proving with large language models in Lean 4 is commonly approached through either step-level tactic prediction with tree search or whole-proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end-to-end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment-level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step-level models. When trained with segment-level supervision on STP, LeanWorkbook, and NuminaMath-LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step-level and whole-proof baselines. Goal-aware rollout further improves existing step-level provers while reducing inference costs. It increases the proof success rate of BFS-Prover-V2-7B from 68.77% to 70.74% and that of InternLM2.5-StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at https://github.com/NJUDeepEngine/SEG-ATP.

preprint2025arXiv

SwitchFS: Asynchronous Metadata Updates for Distributed Filesystems with In-Network Coordination

Distributed filesystem metadata updates are typically synchronous. This creates inherent challenges for access efficiency, load balancing, and directory contention, especially under dynamic and skewed workloads. This paper argues that synchronous updates are overly conservative. We propose SwitchFS with asynchronous metadata updates that allow operations to return early and defer directory updates until reads, both hiding latency and amortizing overhead. The key challenge lies in efficiently maintaining the synchronous POSIX semantics of metadata updates. To address this, SwitchFS is co-designed with a programmable switch, leveraging the limited on-switch resources to track directory states with negligible overhead. This allows SwitchFS to aggregate and apply delayed updates efficiently, using batching and consolidation before directory reads. Evaluation shows that SwitchFS achieves up to 13.34$\times$ and 3.85$\times$ higher throughput, and 61.6% and 57.3% lower latency than two state-of-the-art distributed filesystems, Emulated-InfiniFS and Emulated-CFS, respectively, under skewed workloads. For real-world workloads, SwitchFS improves end-to-end throughput by 21.1$\times$, 1.1$\times$, and 0.3$\times$ over CephFS, Emulated-InfiniFS, and Emulated-CFS, respectively.

preprint2022arXiv

System response analysis in wavenumber domain for linear space-invariant time-varying problems

Being a powerful tool for linear time-invariant (LTI) systems, system response analysis can also be applied to the so-called linear space-invariant (LSI) but time-varying systems, which is a dual of the conventional LTI problems. In this paper, we propose a system response analysis method for LSI problems by conducting Fourier transform of the field distribution on the space instead of time coordinate. Specifically, input and output signals can be expressed in the wavenumber (spatial frequency) domain. In this way, the system function in wavenumber domain can also be obtained for LSI systems. Given an arbitrary input and temporal profile of the medium, the output can be easily predicted using the system function. Moreover, for a complex temporal system, the proposed method allows for decomposing it into multiple simpler subsystems that appear in sequence in time. The system function of the whole system can be efficiently calculated by multiplying those of the individual subsystems.

preprint2021arXiv

Reinventing 2D Convolutions for 3D Images

There have been considerable debates over 2D and 3D representation learning on 3D medical images. 2D approaches could benefit from large-scale 2D pretraining, whereas they are generally weak in capturing large 3D contexts. 3D approaches are natively strong in 3D contexts, however few publicly available 3D medical dataset is large and diverse enough for universal 3D pretraining. Even for hybrid (2D + 3D) approaches, the intrinsic disadvantages within the 2D / 3D parts still exist. In this study, we bridge the gap between 2D and 3D convolutions by reinventing the 2D convolutions. We propose ACS (axial-coronal-sagittal) convolutions to perform natively 3D representation learning, while utilizing the pretrained weights on 2D datasets. In ACS convolutions, 2D convolution kernels are split by channel into three parts, and convoluted separately on the three views (axial, coronal and sagittal) of 3D representations. Theoretically, ANY 2D CNN (ResNet, DenseNet, or DeepLab) is able to be converted into a 3D ACS CNN, with pretrained weight of a same parameter size. Extensive experiments on several medical benchmarks (including classification, segmentation and detection tasks) validate the consistent superiority of the pretrained ACS CNNs, over the 2D / 3D CNN counterparts with / without pretraining. Even without pretraining, the ACS convolution can be used as a plug-and-play replacement of standard 3D convolution, with smaller model size and less computation.

preprint2020arXiv

Defective DP-colorings of sparse simple graphs

DP-coloring (also known as correspondence coloring) is a generalization of list coloring developed recently by Dvořák and Postle. We introduce and study $(i,j)$-defective DP-colorings of simple graphs. Let $g_{DP}(i,j,n)$ be the minimum number of edges in an $n$-vertex DP-$(i,j)$-critical graph. In this paper we determine sharp bound on $g_{DP}(i,j,n)$ for each $i\geq3$ and $j\geq 2i+1$ for infinitely many $n$.

preprint2020arXiv

Hierarchical Style-based Networks for Motion Synthesis

Generating diverse and natural human motion is one of the long-standing goals for creating intelligent characters in the animated world. In this paper, we propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location. Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner. Given the starting and ending states, a memory bank is used to retrieve motion references as source material for short-range clip generation. We first propose to explicitly disentangle the provided motion material into style and content counterparts via bi-linear transformation modelling, where diverse synthesis is achieved by free-form combination of these two components. The short-range clips are then connected to form a long-range motion sequence. Without ground truth annotation, we propose a parameterized bi-directional interpolation scheme to guarantee the physical validity and visual naturalness of generated results. On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion, which is also generalizable to unseen motion data during testing. Moreover, we demonstrate the generated sequences are useful as subgoals for actual physical execution in the animated world.

preprint2020arXiv

Operational Calibration: Debugging Confidence Errors for DNNs in the Field

Trained DNN models are increasingly adopted as integral parts of software systems, but they often perform deficiently in the field. A particularly damaging problem is that DNN models often give false predictions with high confidence, due to the unavoidable slight divergences between operation data and training data. To minimize the loss caused by inaccurate confidence, operational calibration, i.e., calibrating the confidence function of a DNN classifier against its operation domain, becomes a necessary debugging step in the engineering of the whole system. Operational calibration is difficult considering the limited budget of labeling operation data and the weak interpretability of DNN models. We propose a Bayesian approach to operational calibration that gradually corrects the confidence given by the model under calibration with a small number of labeled operation data deliberately selected from a larger set of unlabeled operation data. The approach is made effective and efficient by leveraging the locality of the learned representation of the DNN model and modeling the calibration as Gaussian Process Regression. Comprehensive experiments with various practical datasets and DNN models show that it significantly outperformed alternative methods, and in some difficult tasks it eliminated about 71% to 97% high-confidence (>0.9) errors with only about 10\% of the minimal amount of labeled operation data needed for practical learning techniques to barely work.

preprint2020arXiv

Video Prediction via Example Guidance

In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics. In this work, we propose a simple yet effective framework that can efficiently predict plausible future states. The key insight is that the potential distribution of a sequence could be approximated with analogous ones in a repertoire of training pool, namely, expert examples. By further incorporating a novel optimization scheme into the training procedure, plausible predictions can be sampled efficiently from distribution constructed from the retrieved examples. Meanwhile, our method could be seamlessly integrated with existing stochastic predictive models; significant enhancement is observed with comprehensive experiments in both quantitative and qualitative aspects. We also demonstrate the generalization ability to predict the motion of unseen class, i.e., without access to corresponding data during training phase.

preprint2019arXiv

Target Localization with Jammer Removal Using Frequency Diverse Array

A foremost task in frequency diverse array multiple-input multiple-output (FDA-MIMO) radar is to efficiently obtain the target signal in the presence of interferences. In this paper, we employ a novel "low-rank + low-rank + sparse" decomposition model to extract the low-rank desired signal and suppress the jamming signals from both barrage and burst jammers. In the literature, the barrage jamming signals, which are intentionally interfered by enemy jammer radar, are usually assumed Gaussian distributed. However, such assumption is oversimplified to hold in practice as the interferences often exhibit non-Gaussian properties. Those non-Gaussian jamming signals, known as impulsive noise or burst jamming, are involuntarily deviated from friendly radar or other working radio equipment including amplifier saturation and sensor failures, thunderstorms and man-made noise. The estimation performance of the existing estimators, relied crucially on the Gaussian noise assumption, may degrade substantially since the probability density function (PDF) of burst jamming has heavier tails that exceed a few standard deviations than the Gaussian distribution. To capture a more general signal model with burst jamming in practice, both barrage jamming and burst jamming are included and a two-step "Go Decomposition" (GoDec) method via alternating minimization is devised for such mixed jamming signal model, where the $a$ $priori$ rank information is exploited to suppress two kinds of jammers and estimate the desired target. Simulation results verify the robust performance of the devised scheme.

preprint2015arXiv

Overlapped List Successive Cancellation Approach for Hardware Efficient Polar Code Decoder

This paper presents an efficient hardware design approach for list successive cancellation (LSC) decoding of polar codes. By applying path-overlapping scheme, the l instances of (l > 1) successive cancellation (SC) decoder for LSC with list size l can be cut down to only one. This results in a dramatic reduction of the hardware complexity without any decoding performance loss. We also develop novel approaches to reduce the latencyassociated with the pipeline scheme. Simulation results show that with proposed design approach the hardware efficiency is increased significantly over the recently proposed LSC decoders.

preprint2015arXiv

TC: Throughput Centric Successive Cancellation Decoder Hardware Implementation for Polar Codes

This paper presents a hardware architecture of fast simplified successive cancellation (fast-SSC) algorithm for polar codes, which significantly reduces the decoding latency and dramatically increases the throughput. Algorithmically, fast-SSC algorithm suffers from the fact that its decoder scheduling and the consequent architecture depends on the code rate; this is a challenge for rate-compatible system. However, by exploiting the homogeneousness between the decoding processes of fast constituent polar codes and regular polar codes, the presented design is compatible with any rate. The scheduling plan and the intendedly designed process core are also described. Results show that, compared with the state-of-art decoder, proposed design can achieve at least 60% latency reduction for the codes with length N = 1024. By using Nangate FreePDK 45nm process, proposed design can reach throughput up to 5.81 Gbps and 2.01 Gbps for (1024, 870) and (1024, 512) polar code, respectively.

preprint2015arXiv

XJ-BP: Express Journey Belief Propagation Decoding for Polar Codes

This paper presents a novel propagation (BP) based decoding algorithm for polar codes. The proposed algorithm facilitates belief propagation by utilizing the specific constituent codes that exist in the factor graph, which results in an express journey (XJ) for belief information to propagate in each decoding iteration. In addition, this XJ-BP decoder employs a novel round-trip message passing scheduling method for the increased efficiency. The proposed method simplifies min-sum (MS) BP decoder by 40.6%. Along with the round-trip scheduling, the XJ-BP algorithm reduces the computational complexity of MS BP decoding by 90.4%; this enables an energy-efficient hardware implementation of BP decoding in practice.

Jingwei Xu

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline

Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

SwitchFS: Asynchronous Metadata Updates for Distributed Filesystems with In-Network Coordination

System response analysis in wavenumber domain for linear space-invariant time-varying problems

Reinventing 2D Convolutions for 3D Images

Defective DP-colorings of sparse simple graphs

Hierarchical Style-based Networks for Motion Synthesis

Operational Calibration: Debugging Confidence Errors for DNNs in the Field

Video Prediction via Example Guidance

Target Localization with Jammer Removal Using Frequency Diverse Array

Overlapped List Successive Cancellation Approach for Hardware Efficient Polar Code Decoder

TC: Throughput Centric Successive Cancellation Decoder Hardware Implementation for Polar Codes

XJ-BP: Express Journey Belief Propagation Decoding for Polar Codes