Source author record

Jialin Wu

Jialin Wu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Computation and Language Artificial Intelligence cs.CY

Catalog footprint

What is connected

7works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Possibilities and Implications of the Multi-AI Competition

The possibility of super-AIs taking over the world has been intensively studied by numerous scholars. This paper focuses on the multi-AI competition scenario under the premise of super-AIs in power. Firstly, the article points out the defects of existing arguments supporting single-AI domination and presents arguments in favour of multi-AI competition. Then the article concludes that the multi-AI competition situation is a non-negligible possibility. Attention then turns to whether multi-AI competition is better for the overall good of humanity than a situation where a single AI is in power. After analysing the best, worst, and intermediate scenarios, the article concludes that multi-AI competition is better for humanity. Finally, considering the factors related to the formation of the best-case scenario of multiple AIs, the article gives some suggestions for current initiatives in AI development.

preprint2021arXiv

Visual Question Answering based on Local-Scene-Aware Referring Expression Generation

Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories combined with their relationships or simple question embedding is insufficient for representing complex scenes and explaining decisions. To address this limitation, we propose the use of text expressions generated for images, because such expressions have few structural constraints and can provide richer descriptions of images. The generated expressions can be incorporated with visual features and question embedding to obtain the question-relevant answer. A joint-embedding multi-head attention network is also proposed to model three different information modalities with co-attention. We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction. The quality of the generated expressions was also evaluated on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Experimental results demonstrate the effectiveness of the proposed method and reveal that it outperformed all of the competing methods in terms of both quantitative and qualitative results.

preprint2020arXiv

Generating Question Relevant Captions to Aid Visual Question Answering

Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.

preprint2020arXiv

Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder

Most RNN-based image captioning models receive supervision on the output words to mimic human captions. Therefore, the hidden states can only receive noisy gradient signals via layers of back-propagation through time, leading to less accurate generated captions. Consequently, we propose a novel framework, Hidden State Guidance (HSG), that matches the hidden states in the caption decoder to those in a teacher decoder trained on an easier task of autoencoding the captions conditioned on the image. During training with the REINFORCE algorithm, the conventional rewards are sentence-based evaluation metrics equally distributed to each generated word, no matter their relevance. HSG provides a word-level reward that helps the model learn better hidden representations. Experimental results demonstrate that HSG clearly outperforms various state-of-the-art caption decoders using either raw images or detected objects as inputs.

preprint2020arXiv

Improving VQA and its Explanations \\ by Comparing Competing Explanations

Most recent state-of-the-art Visual Question Answering (VQA) systems are opaque black boxes that are only trained to fit the answer distribution given the question and visual content. As a result, these systems frequently take shortcuts, focusing on simple visual concepts or question priors. This phenomenon becomes more problematic as the questions become complex that requires more reasoning and commonsense knowledge. To address this issue, we present a novel framework that uses explanations for competing answers to help VQA systems select the correct answer. By training on human textual explanations, our framework builds better representations for the questions and visual content, and then reweights confidences in the answer candidates using either generated or retrieved explanations from the training set. We evaluate our framework on the VQA-X dataset, which has more difficult questions with human explanations, achieving new state-of-the-art results on both VQA and its explanations.

preprint2019arXiv

Self-Critical Reasoning for Robust Visual Question Answering

Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5% using textual explanations and 48.5% using automatically annotated regions.

preprint2016arXiv

Action Recognition with Joint Attention on Multi-Level Deep Features

We propose a novel deep supervised neural network for the task of action recognition in videos, which implicitly takes advantage of visual tracking and shares the robustness of both deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In our method, a multi-branch model is proposed to suppress noise from background jitters. Specifically, we firstly extract multi-level deep features from deep CNNs and feed them into 3d-convolutional network. After that we feed those feature cubes into our novel joint LSTM module to predict labels and to generate attention regularization. We evaluate our model on two challenging datasets: UCF101 and HMDB51. The results show that our model achieves the state-of-art by only using convolutional features.

Jialin Wu

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Possibilities and Implications of the Multi-AI Competition

Visual Question Answering based on Local-Scene-Aware Referring Expression Generation

Generating Question Relevant Captions to Aid Visual Question Answering

Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder

Improving VQA and its Explanations \\ by Comparing Competing Explanations

Self-Critical Reasoning for Robust Visual Question Answering

Action Recognition with Joint Attention on Multi-Level Deep Features