Researcher profile

Hung Le

Hung Le contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
20works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

20 published item(s)

preprint2026arXiv

Continual Fine-Tuning of Large Language Models via Program Memory

Parameter-Efficient Fine-Tuning (PEFT), particularly Low-Rank Adaptation (LoRA), has become a standard approach for adapting Large Language Models (LLMs) under limited compute. However, in continual settings where models are updated sequentially with small datasets, conventional LoRA updates struggle to balance rapid adaptation and knowledge retention. Existing methods typically treat the low-rank space as a homogeneous update region, lacking mechanisms to regulate how short-term updates are consolidated over time. We propose a continual LoRA framework with \textbf{Pro}gram memory, inspired by \textbf{C}omplementary \textbf{L}earning Systems in neuroscience. Our approach, dubbed \textbf{ProCL}, organizes LoRA adapters into structured program memory slots that are dynamically retrieved through input-conditioned attention. This enables rapid and localized adaptation, encouraging similar inputs to reuse shared adapter regions while reserving unused capacity for future data. The slots are then combined with the underlying adapter, which maintains a distributed representation that gradually accumulates knowledge across tasks to balance plasticity and stability. Our method operates entirely within the LoRA parameterization and incurs no additional inference cost. Experiments on diverse benchmarks demonstrate improved retention and reduced catastrophic forgetting over other continual LoRA strategies.

preprint2024arXiv

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

preprint2023arXiv

Memory-Augmented Theory of Mind Network

Social reasoning necessitates the capacity of theory of mind (ToM), the ability to contextualise and attribute mental states to others without having access to their internal cognitive structure. Recent machine learning approaches to ToM have demonstrated that we can train the observer to read the past and present behaviours of other agents and infer their beliefs (including false beliefs about things that no longer exist), goals, intentions and future actions. The challenges arise when the behavioural space is complex, demanding skilful space navigation for rapidly changing contexts for an extended period. We tackle the challenges by equipping the observer with novel neural memory mechanisms to encode, and hierarchical attention to selectively retrieve information about others. The memories allow rapid, selective querying of distal related past behaviours of others to deliberatively reason about their current mental state, beliefs and future behaviours. This results in ToMMY, a theory of mind model that learns to reason while making little assumptions about the underlying mental processes. We also construct a new suite of experiments to demonstrate that memories facilitate the learning process and achieve better theory of mind performance, especially for high-demand false-belief tasks that require inferring through multiple steps of changes.

preprint2022arXiv

Approximate Distance Oracles for Planar Graphs with Subpolynomial Error Dependency

Thorup [FOCS'01, JACM'04] and Klein [SODA'01] independently showed that there exists a $(1+ε)$-approximate distance oracle for planar graphs with $O(n (\log n)ε^{-1})$ space and $O(ε^{-1})$ query time. While the dependency on $n$ is nearly linear, the space-query product of their oracles depend quadratically on $1/ε$. Many follow-up results either improved the space \emph{or} the query time of the oracles while having the same, sometimes worst, dependency on $1/ε$. Kawarabayashi, Sommer, and Thorup [SODA'13] were the first to improve the dependency on $1/ε$ from quadratic to nearly linear (at the cost of $\log^*(n)$ factors). It is plausible to conjecture that the linear dependency on $1/ε$ is optimal: for many known distance-related problems in planar graphs, it was proved that the dependency on $1/ε$ is at least linear. In this work, we disprove this conjecture by reducing the dependency of the space-query product on $1/ε$ from linear all the way down to \emph{subpolynomial} $(1/ε)^{o(1)}$. More precisely, we construct an oracle with $O(n\log(n)(ε^{-o(1)} + \log^*n))$ space and $\log^{2+o(1)}(1/ε)$ query time. Our construction is the culmination of several different ideas developed over the past two decades.

preprint2022arXiv

Can't See The Forest for the Trees: Navigating Metric Spaces by Bounded Hop-Diameter Spanners

Spanners for metric spaces have been extensively studied, both in general metrics and in restricted classes, perhaps most notably in low-dimensional Euclidean spaces -- due to their numerous applications. Euclidean spanners can be viewed as means of compressing the $\binom{n}{2}$ pairwise distances of a $d$-dimensional Euclidean space into $O(n) = O_{ε,d}(n)$ spanner edges, so that the spanner distances preserve the original distances to within a factor of $1+ε$, for any $ε> 0$. Moreover, one can compute such spanners in optimal $O(n \log n)$ time. Once the spanner has been computed, it serves as a "proxy" overlay network, on which the computation can proceed, which gives rise to huge savings in space and other important quality measures. On the negative side, by working on the spanner rather than the original metric, one loses the key property of being able to efficiently "navigate" between pairs of points. While in the original metric, one can go from any point to any other via a direct edge, it is unclear how to efficiently navigate in the spanner: How can we translate the existence of a "good" path into an efficient algorithm finding it? Moreover, usually by "good" path we mean a path whose weight approximates the original distance between its endpoints -- but a priori the number of edges (or "hops") in the path could be huge. To control the hop-length of paths, one can try to upper bound the spanner's hop-diameter, but naturally bounded hop-diameter spanners are more complex than spanners with unbounded hop-diameter, which might render the algorithmic task of efficiently finding good paths more challenging. The original metric enables us to navigate optimally -- a single hop (for any two points) with the exact distance, but the price is high -- $Θ(n^2)$ edges. [...]

preprint2022arXiv

Learning Theory of Mind via Dynamic Traits Attribution

Machine learning of Theory of Mind (ToM) is essential to build social agents that co-live with humans and other agents. This capacity, once acquired, will help machines infer the mental states of others from observed contextual action trajectories, enabling future prediction of goals, intention, actions and successor representations. The underlying mechanism for such a prediction remains unclear, however. Inspired by the observation that humans often infer the character traits of others, then use it to explain behaviour, we propose a new neural ToM architecture that learns to generate a latent trait vector of an actor from the past trajectories. This trait vector then multiplicatively modulates the prediction mechanism via a `fast weights' scheme in the prediction neural network, which reads the current context and predicts the behaviour. We empirically show that the fast weights provide a good inductive bias to model the character traits of agents and hence improves mindreading ability. On the indirect assessment of false-belief understanding, the new ToM model enables more efficient helping behaviours.

preprint2022arXiv

Locality-Sensitive Orderings and Applications to Reliable Spanners

Chan, Har-Peled, and Jones [2020] recently developed locality-sensitive ordering (LSO), a new tool that allows one to reduce problems in the Euclidean space $\mathbb{R}^d$ to the $1$-dimensional line. They used LSO's to solve a host of problems. Later, Buchin, Har-Peled, and Ol{á}h [2019,2020] used the LSO of Chan {\em et al. } to construct very sparse \emph{reliable spanners} for the Euclidean space. A highly desirable feature of a reliable spanner is its ability to withstand a massive failure: the network remains functioning even if 90\% of the nodes fail. In a follow-up work, Har-Peled, Mendel, and Ol{á}h [2021] constructed reliable spanners for general and topologically structured metrics. Their construction used a different approach, and is based on sparse covers. In this paper, we develop the theory of LSO's in non-Euclidean metrics by introducing new types of LSO's suitable for general and topologically structured metrics. We then construct such LSO's, as well as constructing considerably improved LSO's for doubling metrics. Afterwards, we use our new LSO's to construct reliable spanners with improved stretch and sparsity parameters. Most prominently, we construct $\tilde{O}(n)$-size reliable spanners for trees and planar graphs with the optimal stretch of $2$. Along the way to the construction of LSO's and reliable spanners, we introduce and construct ultrametric covers, and construct $2$-hop reliable spanners for the line.

preprint2022arXiv

Low Treewidth Embeddings of Planar and Minor-Free Metrics

Cohen-Addad, Filtser, Klein and Le [FOCS'20] constructed a stochastic embedding of minor-free graphs of diameter $D$ into graphs of treewidth $O_ε(\log n)$ with expected additive distortion $+εD$. Cohen-Addad et al. then used the embedding to design the first quasi-polynomial time approximation scheme (QPTAS) for the capacitated vehicle routing problem. Filtser and Le [STOC'21] used the embedding (in a different way) to design a QPTAS for the metric Baker's problems in minor-free graphs. In this work, we devise a new embedding technique to improve the treewidth bound of Cohen-Addad et al. exponentially to $O_ε(\log\log n)^2$. As a corollary, we obtain the first efficient PTAS for the capacitated vehicle routing problem in minor-free graphs. We also significantly improve the running time of the QPTAS for the metric Baker's problems in minor-free graphs from $n^{O_ε(\log(n))}$ to $n^{O_ε(\log\log(n))^3}$. Applying our embedding technique to planar graphs, we obtain a deterministic embedding of planar graphs of diameter $D$ into graphs of treewidth $O((\log\log n)^2)/ε)$ and additive distortion $+εD$ that can be constructed in nearly linear time. Important corollaries of our result include a bicriteria PTAS for metric Baker's problems and a PTAS for the vehicle routing problem with bounded capacity in planar graphs, both run in almost-linear time. The running time of our algorithms is significantly better than previous algorithms that require quadratic time. A key idea in our embedding is the construction of an (exact) emulator for tree metrics with treewidth $O(\log\log n)$ and hop-diameter $O(\log \log n)$. This result may be of independent interest.

preprint2022arXiv

Make The Most of Prior Data: A Solution for Interactive Text Summarization with Preference Feedback

For summarization, human preference is critical to tame outputs of the summarizer in favor of human interests, as ground-truth summaries are scarce and ambiguous. Practical settings require dynamic exchanges between human and AI agent wherein feedback is provided in an online manner, a few at a time. In this paper, we introduce a new framework to train summarization models with preference feedback interactively. By properly leveraging offline data and a novel reward model, we improve the performance regarding ROUGE scores and sample-efficiency. Our experiments on three various datasets confirm the benefit of the proposed framework in active, few-shot and online settings of preference learning.

preprint2022arXiv

Multimodal Dialogue State Tracking

Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes, and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.

preprint2022arXiv

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on a challenging video-grounded dialogue benchmark as well as a video QA benchmark.

preprint2021arXiv

Robust Deep Reinforcement Learning for Extractive Legal Summarization

Automatic summarization of legal texts is an important and still a challenging task since legal documents are often long and complicated with unusual structures and styles. Recent advances of deep models trained end-to-end with differentiable losses can well-summarize natural text, yet when applied to legal domain, they show limited results. In this paper, we propose to use reinforcement learning to train current deep summarization models to improve their performance on the legal domain. To this end, we adopt proximal policy optimization methods and introduce novel reward functions that encourage the generation of candidate summaries satisfying both lexical and semantic criteria. We apply our method to training different summarization backbones and observe a consistent and significant performance gain across 3 public legal datasets.

preprint2021arXiv

Sparse Euclidean Spanners with Tiny Diameter: A Tight Lower Bound

In STOC'95 [ADMSS'95] Arya et al. showed that any set of $n$ points in $\mathbb R^d$ admits a $(1+ε)$-spanner with hop-diameter at most 2 (respectively, 3) and $O(n \log n)$ edges (resp., $O(n \log \log n)$ edges). They also gave a general upper bound tradeoff of hop-diameter at most $k$ and $O(n α_k(n))$ edges, for any $k \ge 2$. The function $α_k$ is the inverse of a certain Ackermann-style function at the $\lfloor k/2 \rfloor$th level of the primitive recursive hierarchy, where $α_0(n) = \lceil n/2 \rceil$, $α_1(n) = \left\lceil \sqrt{n} \right\rceil$, $α_2(n) = \lceil \log{n} \rceil$, $α_3(n) = \lceil \log\log{n} \rceil$, $α_4(n) = \log^* n$, $α_5(n) = \lfloor \frac{1}{2} \log^*n \rfloor$, \ldots. Roughly speaking, for $k \ge 2$ the function $α_{k}$ is close to $\lfloor \frac{k-2}{2} \rfloor$-iterated log-star function, i.e., $\log$ with $\lfloor \frac{k-2}{2} \rfloor$ stars. Also, $α_{2α(n)+4}(n) \le 4$, where $α(n)$ is the one-parameter inverse Ackermann function, which is an extremely slowly growing function. Whether or not this tradeoff is tight has remained open, even for the cases $k = 2$ and $k = 3$. Two lower bounds are known: The first applies only to spanners with stretch 1 and the second is sub-optimal and applies only to sufficiently large (constant) values of $k$. In this paper we prove a tight lower bound for any constant $k$: For any fixed $ε> 0$, any $(1+ε)$-spanner for the uniform line metric with hop-diameter at most $k$ must have at least $Ω(n α_k(n))$ edges.

preprint2020arXiv

Improving Long Handwritten Text Line Recognition with Convolutional Multi-way Associative Memory

Convolutional Recurrent Neural Networks (CRNNs) excel at scene text recognition. Unfortunately, they are likely to suffer from vanishing/exploding gradient problems when processing long text images, which are commonly found in scanned documents. This poses a major challenge to goal of completely solving Optical Character Recognition (OCR) problem. Inspired by recently proposed memory-augmented neural networks (MANNs) for long-term sequential modeling, we present a new architecture dubbed Convolutional Multi-way Associative Memory (CMAM) to tackle the limitation of current CRNNs. By leveraging recent memory accessing mechanisms in MANNs, our architecture demonstrates superior performance against other CRNN counterparts in three real-world long text OCR datasets.

preprint2020arXiv

Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge

Audio-Visual Scene-Aware Dialog (AVSD) is an extension from Video Question Answering (QA) whereby the dialogue agent is required to generate natural language responses to address user queries and carry on conversations. This is a challenging task as it consists of video features of multiple modalities, including text, visual, and audio features. The agent also needs to learn semantic dependencies among user utterances and system responses to make coherent conversations with humans. In this work, we describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge. We adopt dot-product attention to combine text and non-text features of input video. We further enhance the generation capability of the dialogue agent by adopting pointer networks to point to tokens from multiple source sequences in each generation step. Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation among all submissions.

preprint2020arXiv

Non-Autoregressive Dialog State Tracking

Recent efforts in Dialogue State Tracking (DST) for task-oriented dialogues have progressed toward open-vocabulary or generation-based approaches where the models can generate slot value candidates from the dialogue history itself. These approaches have shown good performance gain, especially in complicated dialogue domains with dynamic slot values. However, they fall short in two aspects: (1) they do not allow models to explicitly learn signals across domains and slots to detect potential dependencies among (domain, slot) pairs; and (2) existing models follow auto-regressive approaches which incur high time cost when the dialogue evolves over multiple domains and multiple turns. In this paper, we propose a novel framework of Non-Autoregressive Dialog State Tracking (NADST) which can factor in potential dependencies among domains and slots to optimize the models towards better prediction of dialogue states as a complete set rather than separate slots. In particular, the non-autoregressive nature of our method not only enables decoding in parallel to significantly reduce the latency of DST for real-time dialogue response generation, but also detect dependencies among slots at token level in addition to slot and domain level. Our empirical results show that our model achieves the state-of-the-art joint accuracy across all domains on the MultiWOZ 2.1 corpus, and the latency of our model is an order of magnitude lower than the previous state of the art as the dialogue history extends over time.

preprint2020arXiv

On Light Spanners, Low-treewidth Embeddings and Efficient Traversing in Minor-free Graphs

Understanding the structure of minor-free metrics, namely shortest path metrics obtained over a weighted graph excluding a fixed minor, has been an important research direction since the fundamental work of Robertson and Seymour. A fundamental idea that helps both to understand the structural properties of these metrics and lead to strong algorithmic results is to construct a "small-complexity" graph that approximately preserves distances between pairs of points of the metric. We show the two following structural results for minor-free metrics: 1. Construction of a light subset spanner. Given a subset of vertices called terminals, and $ε$, in polynomial time we construct a subgraph that preserves all pairwise distances between terminals up to a multiplicative $1+ε$ factor, of total weight at most $O_ε(1)$ times the weight of the minimal Steiner tree spanning the terminals. 2. Construction of a stochastic metric embedding into low treewidth graphs with expected additive distortion $εD$. Namely, given a minor free graph $G=(V,E,w)$ of diameter $D$, and parameter $ε$, we construct a distribution $\mathcal{D}$ over dominating metric embeddings into treewidth-$O_ε(\log n)$ graphs such that the additive distortion is at most $εD$. One of our important technical contributions is a novel framework that allows us to reduce \emph{both problems} to problems on simpler graphs of bounded diameter. Our results have the following algorithmic consequences: (1) the first efficient approximation scheme for subset TSP in minor-free metrics; (2) the first approximation scheme for vehicle routing with bounded capacity in minor-free metrics; (3) the first efficient approximation scheme for vehicle routing with bounded capacity on bounded genus metrics.

preprint2020arXiv

Self-Attentive Associative Memory

Heretofore, neural networks with external memory are restricted to single memory with lossy representations of memory interactions. A rich representation of relationships between memory pieces urges a high-order and segregated relational memory. In this paper, we propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory). The idea is implemented through a novel Self-attentive Associative Memory (SAM) operator. Found upon outer product, SAM forms a set of associative memories that represent the hypothetical high-order relationships between arbitrary pairs of memory elements, through which a relational memory is constructed from an item memory. The two memories are wired into a single sequential model capable of both memorization and relational reasoning. We achieve competitive results with our proposed two-memory model in a diversity of machine learning tasks, from challenging synthetic problems to practical testbeds such as geometry, graph, reinforcement learning, and question answering.

preprint2020arXiv

Video-Grounded Dialogues with Pretrained Generation Language Models

Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.

preprint2019arXiv

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github.com/henryhungle/MTN.