Source author record

Lu Hou

Lu Hou appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Computation and Language Networking and Internet Architecture Artificial Intelligence

Catalog footprint

What is connected

5works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, this slicing strategy leads to the fragmentation of original input, i.e., the continuity of contextual information and spatial geometry is lost across patches, adversely affecting performance in cross-patch context perception and position-specific tasks. To overcome these shortcomings, we introduce HiRes-LLaVA, a novel framework designed to efficiently process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compresses the vision tokens based on themselves, preserving the original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related and position-related tasks. Our comprehensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and on EntityGrid-QA, particularly on document-oriented tasks, establishing new standards for handling high-resolution inputs.

preprint2022arXiv

Compression of Generative Pre-trained Language Models via Quantization

The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and the underlying difficulty remains unclear. In this paper, we compress generative PLMs by quantization. We find that previous quantization methods fail on generative tasks due to the \textit{homogeneous word embeddings} caused by reduced capacity, and \textit{varied distribution of weights}. Correspondingly, we propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules. Empirical results on various tasks show that our proposed method outperforms the state-of-the-art compression methods on generative PLMs by a clear margin. With comparable performance with the full-precision models, we achieve 14.4x and 13.4x compression rates on GPT-2 and BART, respectively.

preprint2022arXiv

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of multimodal generative tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous state-of-the-art zero-shot model with $7\times$ fewer parameters. Furthermore, the original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.

preprint2015arXiv

Reliable and Efficient Autonomous Driving: the Need for Heterogeneous Vehicular Networks

Autonomous driving technology has been regarded as a promising solution to reduce road accidents and traffic congestion, as well as to optimize the usage of fuel and lane. Reliable and high efficient Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communications are essential to let commercial autonomous driving vehicles be on the road before 2020. The current paper firstly presents the concept of Heterogeneous Vehicular NETworks (HetVNETs) for autonomous driving, in which an improved protocol stack is proposed to satisfy the communication requirements of not only safety but also non-safety services. We then consider and study in detail several typical scenarios for autonomous driving. In order to tackle the potential challenges raised by the autonomous driving vehicles in HetVNETs, new techniques from transmission to networking are proposed as potential solutions.

preprint2015arXiv

Soft-Defined Heterogeneous Vehicular Network: Architecture and Challenges

Heterogeneous Vehicular NETworks (HetVNETs) can meet various quality-of-service (QoS) requirements for intelligent transport system (ITS) services by integrating different access networks coherently. However, the current network architecture for HetVNET cannot efficiently deal with the increasing demands of rapidly changing network landscape. Thanks to the centralization and flexibility of the cloud radio access network (Cloud-RAN), soft-defined networking (SDN) can conveniently be applied to support the dynamic nature of future HetVNET functions and various applications while reducing the operating costs. In this paper, we first propose the multi-layer Cloud RAN architecture for implementing the new network, where the multi-domain resources can be exploited as needed for vehicle users. Then, the high-level design of soft-defined HetVNET is presented in detail. Finally, we briefly discuss key challenges and solutions for this new network, corroborating its feasibility in the emerging fifth-generation (5G) era.

Lu Hou

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Compression of Generative Pre-trained Language Models via Quantization

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

Reliable and Efficient Autonomous Driving: the Need for Heterogeneous Vehicular Networks

Soft-Defined Heterogeneous Vehicular Network: Architecture and Challenges