Researcher profile

Shiming Xiang

Shiming Xiang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

Adversarial Gradient Driven Exploration for Deep Click-Through Rate Prediction

Exploration-Exploitation (E{\&}E) algorithms are commonly adopted to deal with the feedback-loop issue in large-scale online recommender systems. Most of existing studies believe that high uncertainty can be a good indicator of potential reward, and thus primarily focus on the estimation of model uncertainty. We argue that such an approach overlooks the subsequent effect of exploration on model training. From the perspective of online learning, the adoption of an exploration strategy would also affect the collecting of training data, which further influences model learning. To understand the interaction between exploration and training, we design a Pseudo-Exploration module that simulates the model updating process after a certain item is explored and the corresponding feedback is received. We further show that such a process is equivalent to adding an adversarial perturbation to the model input, and thereby name our proposed approach as an the Adversarial Gradient Driven Exploration (AGE). For production deployment, we propose a dynamic gating unit to pre-determine the utility of an exploration. This enables us to utilize the limited amount of resources for exploration, and avoid wasting pageview resources on ineffective exploration. The effectiveness of AGE was firstly examined through an extensive number of ablation studies on an academic dataset. Meanwhile, AGE has also been deployed to one of the world-leading display advertising platforms, and we observe significant improvements on various top-line evaluation metrics.

preprint2022arXiv

Expanding Language-Image Pretrained Models for General Video Recognition

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://aka.ms/X-CLIP

preprint2022arXiv

Pro-tuning: Unified Prompt Tuning for Vision Tasks

In computer vision, fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. However, deploying it in practice is quite challenging, due to adopting parameter inefficient global update and heavily relying on high-quality downstream data. Recently, prompt-based learning, which adds a task-relevant prompt to adapt the downstream tasks to pre-trained models, has drastically boosted the performance of many natural language downstream tasks. In this work, we extend this notable transfer ability benefited from prompt into vision models as an alternative to fine-tuning. To this end, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks. The key to Pro-tuning is prompt-based tuning, i.e., learning task-specific vision prompts for downstream input images with the pre-trained model frozen. By only training a few additional parameters, it can work on diverse CNN-based and Transformer-based architectures. Extensive experiments evidence that Pro-tuning outperforms fine-tuning in a broad range of vision tasks and scenarios, including image classification (generic objects, class imbalance, image corruption, adversarial robustness, and out-of-distribution generalization), and dense prediction tasks such as object detection and semantic segmentation.

preprint2020arXiv

Learning Where to Focus for Efficient Video Object Detection

Transferring existing image-based detectors to the video is non-trivial since the quality of frames is always deteriorated by part occlusion, rare pose, and motion blur. Previous approaches exploit to propagate and aggregate features across video frames by using optical flow-warping. However, directly applying image-level optical flow onto the high-level features might not establish accurate spatial correspondences. Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among adjacent frame features accurately. The sampled locations are first randomly initialized, then updated iteratively to find better spatial correspondences guided by detection supervision progressively. Besides, Sparsely Recursive Feature Updating (SRFU) module and Dense Feature Aggregation (DFA) module are also introduced to model temporal relations and enhance per-frame features, respectively. Without bells and whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed. Code will be made available at https://github.com/jiangzhengkai/LSTS.

preprint2020arXiv

Precipitation Forecasting via Multi-Scale Deconstructed ConvLSTM

Numerical Weather Prediction (NWP), is widely used in precipitation forecasting, based on complex equations of atmospheric motion requires supercomputers to infer the state of the atmosphere. Due to the complexity of the task and the huge computation, this methodology has the problems of inefficiency and non-economic. With the rapid development of meteorological technology, the collection of plentiful numerical meteorological data offers opportunities to develop data-driven models for NMP task. In this paper, we consider to combine NWP with deep learning. Firstly, to improve the spatiotemporal modeling of meteorological elements, a deconstruction mechanism and the multi-scale filters are composed to propose a multi-scale deconstructed ConvLSTM (MSD-ConvLSTM). The MSD-ConvLSTM captures and fuses the contextual information by multi-scale filters with low parameter consumption. Furthermore, an encoder-decoder is constructed to encode the features of multiple meteorological elements by deep CNN and decode the spatiotemporal information from different elements by the MSD-ConvLSTM. Our method demonstrates the data-driven way is significance for the weather prediction, which can be confirmed from the experimental results of precipitation forecasting on the European Centre Weather Forecasts (EC) and China Meteorological Forecasts (CM) datasets.