Source author record

Han Guo

Han Guo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language Information Theory math.IT Artificial Intelligence Computer Vision math.NA Mathematical Software Multimedia Neural and Evolutionary Computing Numerical Analysis

Catalog footprint

What is connected

8works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.

preprint2024arXiv

Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

Multi-objective reinforcement learning (MORL) aims to find a set of high-performing and diverse policies that address trade-offs between multiple conflicting objectives. However, in practice, decision makers (DMs) often deploy only one or a limited number of trade-off policies. Providing too many diversified trade-off policies to the DM not only significantly increases their workload but also introduces noise in multi-criterion decision-making. With this in mind, we propose a human-in-the-loop policy optimization framework for preference-based MORL that interactively identifies policies of interest. Our method proactively learns the DM's implicit preference information without requiring any a priori knowledge, which is often unavailable in real-world black-box decision scenarios. The learned preference information is used to progressively guide policy optimization towards policies of interest. We evaluate our approach against three conventional MORL algorithms that do not consider preference information and four state-of-the-art preference-based MORL algorithms on two MORL environments for robot control and smart grid management. Experimental results fully demonstrate the effectiveness of our proposed method in comparison to the other peer algorithms.

preprint2022arXiv

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

The ability to quickly learn from a small quantity oftraining data widens the range of machine learning applications. In this paper, we propose a data-efficient image captioning model, VisualGPT, which leverages the linguistic knowledge from a large pretrained language model(LM). A crucial challenge is to balance between the use of visual information in the image and prior linguistic knowledge acquired from pretraining. We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder ona small amount of in-domain training data. The proposed self-resurrecting activation unit produces sparse activations but has reduced susceptibility to zero gradients. We train the proposed model, VisualGPT, on 0.1%, 0.5% and 1% of MSCOCO and Conceptual Captions training data. Under these conditions, we outperform the best baseline model by up to 10.8% CIDEr on MS COCO and upto 5.4% CIDEr on Conceptual Captions. Further, Visual-GPT achieves the state-of-the-art result on IU X-ray, a medical report generation dataset. To the best of our knowledge, this is the first work that improves data efficiency of image captioning by utilizing LM pretrained on unimodal data. Our code is available at: https://github.com/Vision-CAIR/VisualGPT.

preprint2020arXiv

Butterfly factorization via randomized matrix-vector multiplications

This paper presents an adaptive randomized algorithm for computing the butterfly factorization of a $m\times n$ matrix with $m\approx n$ provided that both the matrix and its transpose can be rapidly applied to arbitrary vectors. The resulting factorization is composed of $O(\log n)$ sparse factors, each containing $O(n)$ nonzero entries. The factorization can be attained using $O(n^{3/2}\log n)$ computation and $O(n\log n)$ memory resources. The proposed algorithm applies to matrices with strong and weak admissibility conditions arising from surface integral equation solvers with a rigorous error bound, and is implemented in parallel.

preprint2020arXiv

Multi-Source Domain Adaptation for Text Classification via DistanceNet-Bandits

Domain adaptation performance of a learning algorithm on a target domain is a function of its source domain error and a divergence measure between the data distribution of these two domains. We present a study of various distance-based measures in the context of NLP tasks, that characterize the dissimilarity between domains based on sample estimates. We first conduct analysis experiments to show which of these distance measures can best differentiate samples from same versus different domains, and are correlated with empirical results. Next, we develop a DistanceNet model which uses these distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation. Finally, we extend this model to a novel DistanceNet-Bandit model, which employs a multi-armed bandit controller to dynamically switch between multiple source domains and allow the model to learn an optimal trajectory and mixture of domains for transfer to the low-resource target domain. We conduct experiments on popular sentiment analysis datasets with several diverse domains and show that our DistanceNet model, as well as its dynamic bandit variant, can outperform competitive baselines in the context of unsupervised domain adaptation.

preprint2016arXiv

Correlated-PCA: Principal Components' Analysis when Data and Noise are Correlated

Given a matrix of observed data, Principal Components Analysis (PCA) computes a small number of orthogonal directions that contain most of its variability. Provably accurate solutions for PCA have been in use for a long time. However, to the best of our knowledge, all existing theoretical guarantees for it assume that the data and the corrupting noise are mutually independent, or at least uncorrelated. This is valid in practice often, but not always. In this paper, we study the PCA problem in the setting where the data and noise can be correlated. Such noise is often also referred to as "data-dependent noise". We obtain a correctness result for the standard eigenvalue decomposition (EVD) based solution to PCA under simple assumptions on the data-noise correlation. We also develop and analyze a generalization of EVD, cluster-EVD, that improves upon EVD in certain regimes.

preprint2016arXiv

Correlated-PCA: Principal Components' Analysis when Data and Noise are Correlated

preprint2014arXiv

An Online Algorithm for Separating Sparse and Low-dimensional Signal Sequences from their Sum

This paper designs and evaluates a practical algorithm, called practical recursive projected compressive sensing (Prac-ReProCS), for recovering a time sequence of sparse vectors $S_t$ and a time sequence of dense vectors $L_t$ from their sum, $M_t:= S_t + L_t$, when any subsequence of the $L_t$'s lies in a slowly changing low-dimensional subspace. A key application where this problem occurs is in video layering where the goal is to separate a video sequence into a slowly changing background sequence and a sparse foreground sequence that consists of one or more moving regions/objects. Prac-ReProCS is a practical modification of its theoretical counterpart which was analyzed in our recent work. Experimental comparisons demonstrating the advantage of the approach for both simulated and real videos are shown. Extension to the undersampled case is also developed.

Han Guo

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Butterfly factorization via randomized matrix-vector multiplications

Multi-Source Domain Adaptation for Text Classification via DistanceNet-Bandits

Correlated-PCA: Principal Components' Analysis when Data and Noise are Correlated

Correlated-PCA: Principal Components' Analysis when Data and Noise are Correlated

An Online Algorithm for Separating Sparse and Low-dimensional Signal Sequences from their Sum