Researcher profile

Zhicheng Huang

Zhicheng Huang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2023arXiv

CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition

Contrastive Masked Autoencoder (CMAE), as a new self-supervised framework, has shown its potential of learning expressive feature representations in visual image recognition. This work shows that CMAE also trivially generalizes well on video action recognition without modifying the architecture and the loss criterion. By directly replacing the original pixel shift with the temporal shift, our CMAE for visual action recognition, CMAE-V for short, can generate stronger feature representations than its counterpart based on pure masked autoencoders. Notably, CMAE-V, with a hybrid architecture, can achieve 82.2% and 71.6% top-1 accuracy on the Kinetics-400 and Something-something V2 datasets, respectively. We hope this report could provide some informative inspiration for future works.

preprint2022arXiv

WebRobot: Web Robotic Process Automation using Interactive Programming-by-Demonstration

It is imperative to democratize robotic process automation (RPA), as RPA has become a main driver of the digital transformation but is still technically very demanding to construct, especially for non-experts. In this paper, we study how to automate an important class of RPA tasks, dubbed web RPA, which are concerned with constructing software bots that automate interactions across data and a web browser. Our main contributions are twofold. First, we develop a formal foundation which allows semantically reasoning about web RPA programs and formulate its synthesis problem in a principled manner. Second, we propose a web RPA program synthesis algorithm based on a new idea called speculative rewriting. This leads to a novel speculate-and-validate methodology in the context of rewrite-based program synthesis, which has also shown to be both theoretically simple and practically efficient for synthesizing programs from demonstrations. We have built these ideas in a new interactive synthesizer called WebRobot and evaluate it on 76 web RPA benchmarks. Our results show that WebRobot automated a majority of them effectively. Furthermore, we show that WebRobot compares favorably with a conventional rewrite-based synthesis baseline implemented using egg. Finally, we conduct a small user study demonstrating WebRobot is also usable.

preprint2020arXiv

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose to use a random pixel sampling mechanism to enhance the robustness of visual representation and to apply the Masked Language Model and Image-Text Matching as pre-training tasks. Extensive experiments on downstream tasks with our pre-trained model show that our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR). Particularly, we boost the performance of a single model in VQA task by 2.17 points compared with SOTA under fair comparison.