Source author record

Tianqi Chen

Tianqi Chen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

20works

20topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning

Robust anatomical segmentation of chest X-rays (CXRs) remains challenging due to the scarcity of comprehensive annotations and the substantial variability of real-world acquisition conditions. We propose AnyCXR, a unified framework that enables generalizable multi-organ segmentation across arbitrary CXR projection angles using only synthetic supervision. The method combines a Multi-stage Domain Randomization (MSDR) engine, which generates over 100,000 anatomically faithful and highly diverse synthetic radiographs from 3D CT volumes, with a Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial and imperfect labels by enforcing anatomical consistency in a latent space. Trained entirely on synthetic data, AnyCXR achieves strong zero-shot generalization on multiple real-world datasets, providing accurate delineation of 54 anatomical structures in PA, lateral, and oblique views. The resulting segmentation maps support downstream clinical tasks, including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification, where the incorporation of anatomical priors improves diagnostic performance. These results demonstrate that AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.

preprint2026arXiv

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents' GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.

preprint2022arXiv

Bayesian network mediation analysis with application to brain functional connectome

Brain functional connectome, the collection of interconnected neural circuits along functional networks, is one of the most cutting edge neuroimaging traits, and has a potential to play a mediating role within the effect pathway between an exposure and an outcome. While existing mediation analytic approaches are capable of providing insight into complex processes, they mainly focus on a univariate mediator or mediator vector, without considering network-variate mediators. To fill the methodological gap and accomplish this exciting and urgent application, in the paper, we propose an integrative mediation analysis under a Bayesian paradigm with networks entailing the mediation effect. To parameterize the network measurements, we introduce individually specified stochastic block models with unknown block allocation, and naturally bridge effect elements through the latent network mediators induced by the connectivity weights across network modules. To enable the identification of truly active mediating components, we simultaneously impose a feature selection across network mediators. We show the superiority of our model in estimating different effect components and selecting active mediating network structures. As a practical illustration of this approach's application to network neuroscience, we characterize the relationship between a therapeutic intervention and opioid abstinence as mediated by brain functional sub-networks.

preprint2022arXiv

SONAR: Joint Architecture and System Optimization Search

There is a growing need to deploy machine learning for different tasks on a wide array of new hardware platforms. Such deployment scenarios require tackling multiple challenges, including identifying a model architecture that can achieve a suitable predictive accuracy (architecture search), and finding an efficient implementation of the model to satisfy underlying hardware-specific systems constraints such as latency (system optimization search). Existing works treat architecture search and system optimization search as separate problems and solve them sequentially. In this paper, we instead propose to solve these problems jointly, and introduce a simple but effective baseline method called SONAR that interleaves these two search problems. SONAR aims to efficiently optimize for predictive accuracy and inference latency by applying early stopping to both search processes. Our experiments on multiple different hardware back-ends show that SONAR identifies nearly optimal architectures 30 times faster than a brute force approach.

preprint2022arXiv

Stack operation of tensor networks

The tensor network, as a facterization of tensors, aims at performing the operations that are common for normal tensors, such as addition, contraction and stacking. However, due to its non-unique network structure, only the tensor network contraction is so far well defined. In this paper, we propose a mathematically rigorous definition for the tensor network stack approach, that compress a large amount of tensor networks into a single one without changing their structures and configurations. We illustrate the main ideas with the matrix product states based machine learning as an example. Our results are compared with the for loop and the efficient coding method on both CPU and GPU.

preprint2022arXiv

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then offload the computations to optimized kernels for dense tensor algebra. Such techniques can, however, lead to a lot of wasted computation and therefore, a loss in performance. This paper presents CoRa, a tensor compiler that allows users to easily generate efficient code for ragged tensor operators targeting a wide range of CPUs and GPUs. Evaluating CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model, we find that CoRa (i)performs competitively with hand-optimized implementations of the operators and the transformer encoder and (ii) achieves, over PyTorch, a 1.6X geomean speedup for the encoder on an Nvidia GPU and a 1.86X geomean speedup for the multi-head attention module used in transformers on an ARM CPU.

preprint2021arXiv

Cortex: A Compiler for Recursive Deep Learning Models

Optimizing deep learning models is generally performed in two steps: (i) high-level graph optimizations such as kernel fusion and (ii) low level kernel optimizations such as those found in vendor libraries. This approach often leaves significant performance on the table, especially for the case of recursive deep learning models. In this paper, we present Cortex, a compiler-based approach to generate highly-efficient code for recursive models for low latency inference. Our compiler approach and low reliance on vendor libraries enables us to perform end-to-end optimizations, leading to up to 14X lower inference latencies over past work, across different backends.

preprint2021arXiv

Thermodynamic performance of a periodically driven harmonic oscillator correlated with the baths

We consider a harmonic oscillator under periodic driving and coupled to two harmonic-oscillator heat baths at different temperatures. We use the thermofield transformation with chain mapping for this setup, which allows us to study the unitary evolution of the system and the baths up to a time when the periodic steady state emerges in the system. We characterize this periodic steady state, and we show that, by tuning the system and the bath parameters, one can turn this system from an engine to an accelerator or even to a heater. The possibility to study the unitary evolution of the system and baths also allows us to evaluate the steady correlations that build between the system and the baths, and correlations that grow between the baths.

preprint2020arXiv

Effects of staggered Dzyaloshinskii-Moriya interactions in a quasi-two-dimensional Shastry-Sutherland model

Frustrated quantum spin systems exhibit exotic physics induced by external magnetic field with anisotropic interactions. Here, we study the effect of non-uniform Dzyaloshinskii-Moriya (DM) interactions on a quasi-two-dimensional Shastry-Sutherland lattice using a matrix product states (MPS) algorithm. We first recover the magnetization plateau structure present in this geometry and then we show that both interdimer and intradimer DM interactions significantly modify the plateaux. The non-number-conserving intradimer interaction smoothens the shape of the magnetization curve, while the number-conserving interdimer interaction induces different small plateaux, which are signatures of the finite size of the system. Interestingly, the interdimer DM interaction induces chirality in the system. We thus characterize these chiral phases with particular emphasis to their robustness against intradimer DM interactions.

preprint2020arXiv

Steady state quantum transport through an anharmonic oscillator strongly coupled to two heat reservoirs

We investigate the transport properties of an anharmonic oscillator, modeled by a single-site Bose-Hubbard model, coupled to two different thermal baths using the numerically exact thermofield based chain-mapping matrix product states (TCMPS) approach. We compare the effectiveness of TCMPS to probe the nonequilibrium dynamics of strongly interacting system irrespective of the system-bath coupling against the global master equation approach in Gorini-Kossakowski-Sudarshan-Lindblad form. We discuss the effect of on-site interactions, temperature bias as well as the system-bath couplings on the steady state transport properties. Last we also show evidence of non-Markovian dynamics by studying the non-monotonicity of the time evolution of the trace distance between two different initial states.

preprint2019arXiv

Skyrmion quantum spin Hall effect

The quantum spin Hall effect is conventionally thought to require a strong spin-orbit coupling, producing an effective spin-dependent magnetic field. However, spin currents can also be present without transport of spins, for example, in spin-waves or skyrmions. In this paper, we show that topological skyrmionic spin textures can be used to realize a quantum spin Hall effect. From basic arguments relating to the single-valuedness of the wave function, we deduce that loop integrals of the derivative of the Hamiltonian must have a spectrum that is integer multiples of $ 2 π$. By relating this to the spin current, we form a new quantity called the quantized spin current which obeys a precise quantization rule. This allows us to derive a quantum spin Hall effect, which we illustrate with an example of a spin-1 Bose-Einstein condensate.

preprint2016arXiv

Net2Net: Accelerating Learning via Knowledge Transfer

We introduce techniques for rapidly transferring the information stored in one neural net into another neural net. The main purpose is to accelerate the training of a significantly larger neural net. During real-world workflows, one often trains very many different neural networks during the experimentation and design process. This is a wasteful process in which each new model is trained from scratch. Our Net2Net technique accelerates the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. Our techniques are based on the concept of function-preserving transformations between neural network specifications. This differs from previous approaches to pre-training that altered the function represented by a neural net when adding layers to it. Using our knowledge transfer mechanism to add depth to Inception modules, we demonstrate a new state of the art accuracy rating on the ImageNet dataset.

preprint2016arXiv

Training Deep Nets with Sublinear Memory Cost

We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

preprint2015arXiv

A Complete Recipe for Stochastic Gradient MCMC

Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of full-data gradients in the dynamic simulations. However, such stochastic gradient MCMC samplers have lagged behind their full-data counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is non-trivial. Even with simple dynamics, significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise. In this paper, we provide a general recipe for constructing MCMC samplers--including stochastic gradient versions--based on continuous Markov processes specified via two matrices. We constructively prove that the framework is complete. That is, any continuous Markov process that provides samples from the target distribution can be written in our framework. We show how previous continuous-dynamic samplers can be trivially "reinvented" in our framework, avoiding the complicated sampler-specific proofs. We likewise use our recipe to straightforwardly propose a new state-adaptive sampler: stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC). Our experiments on simulated data and a streaming Wikipedia analysis demonstrate that the proposed SGRHMC sampler inherits the benefits of Riemann HMC, with the scalability of stochastic gradient methods.

preprint2015arXiv

Empirical Evaluation of Rectified Activations in Convolutional Network

In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear units (RReLU). We evaluate these activation function on standard image classification task. Our experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results. Thus our findings are negative on the common belief that sparsity is the key of good performance in ReLU. Moreover, on small scale dataset, using deterministic negative slope or learning it are both prone to overfitting. They are not as effective as using their randomized counterpart. By using RReLU, we achieved 75.68\% accuracy on CIFAR-100 test set without multiple test or ensemble.

preprint2015arXiv

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from mobile devices to distributed GPU clusters. This paper describes both the API design and the system implementation of MXNet, and explains how embedding of both symbolic expression and tensor operation is handled in a unified fashion. Our preliminary experiments reveal promising results on large scale deep neural network applications using multiple GPU machines.

preprint2014arXiv

A Parallel and Efficient Algorithm for Learning to Match

Many tasks in data mining and related fields can be formalized as matching between objects in two heterogeneous domains, including collaborative filtering, link prediction, image tagging, and web search. Machine learning techniques, referred to as learning-to-match in this paper, have been successfully applied to the problems. Among them, a class of state-of-the-art methods, named feature-based matrix factorization, formalize the task as an extension to matrix factorization by incorporating auxiliary features into the model. Unfortunately, making those algorithms scale to real world problems is challenging, and simple parallelization strategies fail due to the complex cross talking patterns between sub-tasks. In this paper, we tackle this challenge with a novel parallel and efficient algorithm for feature-based matrix factorization. Our algorithm, based on coordinate descent, can easily handle hundreds of millions of instances and features on a single machine. The key recipe of this algorithm is an iterative relaxation of the objective to facilitate parallel updates of parameters, with guaranteed convergence on minimizing the original objective function. Experimental results demonstrate that the proposed method is effective on a wide range of matching problems, with efficiency significantly improved upon the baselines while accuracy retained unchanged.

preprint2014arXiv

Stochastic Gradient Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) sampling methods provide a mechanism for defining distant proposals with high acceptance probabilities in a Metropolis-Hastings framework, enabling more efficient exploration of the state space than standard random-walk proposals. The popularity of such methods has grown significantly in recent years. However, a limitation of HMC methods is the required gradient computation for simulation of the Hamiltonian dynamical system-such computation is infeasible in problems involving a large sample size or streaming data. Instead, we must rely on a noisy gradient estimate computed from a subset of the data. In this paper, we explore the properties of such a stochastic gradient HMC approach. Surprisingly, the natural implementation of the stochastic approximation can be arbitrarily bad. To address this problem we introduce a variant that uses second-order Langevin dynamics with a friction term that counteracts the effects of the noisy gradient, maintaining the desired target distribution as the invariant distribution. Results on simulated data validate our theory. We also provide an application of our methods to a classification task using neural networks and to online Bayesian matrix factorization.

preprint2012arXiv

Relation of a New Interpretation of Stochastic Differential Equations to Ito Process

Stochastic differential equations (SDE) are widely used in modeling stochastic dynamics in literature. However, SDE alone is not enough to determine a unique process. A specified interpretation for stochastic integration is needed. Different interpretations specify different dynamics. Recently, a new interpretation of SDE is put forward by one of us. This interpretation has a built-in Boltzmann-Gibbs distribution and shows the existence of potential function for general processes, which reveals both local and global dynamics. Despite its powerful property, its relation with classical ones in arbitrary dimension remains obscure. In this paper, we will clarify such connection and derive the concise relation between the new interpretation and Ito process. We point out that the derived relation is experimentally testable.

preprint2011arXiv

Feature-Based Matrix Factorization

Recommender system has been more and more popular and widely used in many applications recently. The increasing information available, not only in quantities but also in types, leads to a big challenge for recommender system that how to leverage these rich information to get a better performance. Most traditional approaches try to design a specific model for each scenario, which demands great efforts in developing and modifying models. In this technical report, we describe our implementation of feature-based matrix factorization. This model is an abstract of many variants of matrix factorization models, and new types of information can be utilized by simply defining new features, without modifying any lines of code. Using the toolkit, we built the best single model reported on track 1 of KDDCup'11.

Tianqi Chen

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Bayesian network mediation analysis with application to brain functional connectome

SONAR: Joint Architecture and System Optimization Search

Stack operation of tensor networks

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

Cortex: A Compiler for Recursive Deep Learning Models

Thermodynamic performance of a periodically driven harmonic oscillator correlated with the baths

Effects of staggered Dzyaloshinskii-Moriya interactions in a quasi-two-dimensional Shastry-Sutherland model

Steady state quantum transport through an anharmonic oscillator strongly coupled to two heat reservoirs

Skyrmion quantum spin Hall effect

Net2Net: Accelerating Learning via Knowledge Transfer

Training Deep Nets with Sublinear Memory Cost

A Complete Recipe for Stochastic Gradient MCMC

Empirical Evaluation of Rectified Activations in Convolutional Network

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

A Parallel and Efficient Algorithm for Learning to Match

Stochastic Gradient Hamiltonian Monte Carlo

Relation of a New Interpretation of Stochastic Differential Equations to Ito Process

Feature-Based Matrix Factorization