Source author record

Chung-Ching Lin

Chung-Ching Lin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision eess.SP eess.SY Systems and Control

Catalog footprint

What is connected

7works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Cross-modal Representation Learning for Zero-shot Action Recognition

We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic gap, and the hubness problem. Under a rigorous zero-shot setting of not pre-training on additional datasets, the experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets. Code will be made available.

preprint2022arXiv

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence generation. However, existing video-language (VidL) models still require task-specific designs in model architecture and training objectives for each task. In this work, we explore a unified VidL framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate the advantage of LAVENDER over existing VidL methods in: (i) supporting all downstream tasks with just a single set of parameter values when multi-task finetuned; (ii) few-shot generalization on various downstream tasks; and (iii) enabling zero-shot evaluation on video question answering tasks. Code is available at https://github.com/microsoft/LAVENDER.

preprint2022arXiv

Multi-Mode Spatial Signal Processor with Rainbow-like Fast Beam Training and Wideband Communications using True-Time-Delay Arrays

Initial access in millimeter-wave (mmW) wireless is critical toward successful realization of the fifth-generation (5G) wireless networks and beyond. Limited bandwidth in existing standards and use of phase-shifters in analog/hybrid phased-antenna arrays (PAA) are not suited for these emerging standards demanding low-latency direction finding. This work proposes a reconfigurable true-time-delay (TTD) based spatial signal processor (SSP) with frequency-division beam training methodology and wideband beam-squint less data communications. Discrete-time delay compensated clocking technique is used to support 800~MHz bandwidth with a large unity-gain bandwidth ring-amplifier (RAMP)-based signal combiner. To extensively characterize the proposed SSP across different SSP modes and frequency-angle pairs, an automated testbed is developed using computer-vision techniques that significantly speeds up the testing progress and minimize possible human errors. Using seven levels of time-interleaving for each of the 4 antenna elements, the TTD SSP has a delay range of 3.8 ns over 800 MHz and achieves unique frequency-to-angle mapping in the beamtraining mode with nearly 12 dB frequency-independent gain in the beamforming mode. The SSP is prototyped in 65nm CMOS with an area of 1.98mm$^2$ consuming only 29 mW excluding buffers. Further, an error vector magnitude (EVM) of 9.8% is realized for 16-QAM modulation at a speed of 122.8 Mb/s.

preprint2022arXiv

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e.g., video question answering). Moreover, to avoid the inherent redundancy in consecutive video frames, we propose adaptively learning a sparse attention mask and optimizing it for task-specific performance improvement through better long-range video sequence modeling. Through extensive experiments on 5 video captioning datasets, we show that SwinBERT achieves across-the-board performance improvements over previous methods, often by a large margin. The learned sparse attention masks in addition push the limit to new state of the arts, and can be transferred between different video lengths and between different datasets. Code is available at https://github.com/microsoft/SwinBERT

preprint2021arXiv

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1 & V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AdaFuse/

preprint2020arXiv

AR-Net: Adaptive Frame Resolution for Efficient Action Recognition

Action recognition is an open and challenging problem in computer vision. While current state-of-the-art models offer excellent recognition results, their computational expense limits their impact for many real-world applications. In this paper, we propose a novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Specifically, given a video frame, a policy network is used to decide what input resolution should be used for processing by the action recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on several challenging action recognition benchmark datasets well demonstrate the efficacy of our proposed approach over state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AR-Net

preprint2020arXiv

True-Time-Delay Arrays for Fast Beam Training in Wideband Millimeter-Wave Systems

The best beam steering directions are estimated through beam training, which is one of the most important and challenging tasks in millimeter-wave and sub-terahertz communications. Novel array architectures and signal processing techniques are required to avoid prohibitive beam training overhead associated with large antenna arrays and narrow beams. In this work, we leverage recent developments in true-time-delay (TTD) arrays with large delay-bandwidth products to accelerate beam training using frequency-dependent probing beams. We propose and study two TTD architecture candidates, including analog and hybrid analog-digital arrays, that can facilitate beam training with only one wideband pilot. We also propose a suitable algorithm that requires a single pilot to achieve high-accuracy estimation of angle of arrival. The proposed array architectures are compared in terms of beam training requirements and performance, robustness to practical hardware impairments, and power consumption. The findings suggest that the analog and hybrid TTD arrays achieve a sub-degree beam alignment precision with 66% and 25% lower power consumption than a fully digital array, respectively. Our results yield important design trade-offs among the basic system parameters, power consumption, and accuracy of angle of arrival estimation in fast TTD beam training.

Chung-Ching Lin

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Cross-modal Representation Learning for Zero-shot Action Recognition

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Multi-Mode Spatial Signal Processor with Rainbow-like Fast Beam Training and Wideband Communications using True-Time-Delay Arrays

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

AR-Net: Adaptive Frame Resolution for Efficient Action Recognition

True-Time-Delay Arrays for Fast Beam Training in Wideband Millimeter-Wave Systems