Source author record

Quanzeng You

Quanzeng You appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Social and Information Networks Information Retrieval Artificial Intelligence Machine Learning Multimedia

Catalog footprint

What is connected

10works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention

Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework. Although these methods generally predict higher quality object segmentation masks, they can fail to associate instances in challenging cases because they do not explicitly model the temporal instance consistency for adjacent frames. We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context. Our extensive experiments demonstrate that the Inter-Frame Recurrent Attention significantly improves temporal instance consistency while maintaining the quality of the object segmentation masks. Our model achieves state-of-the-art accuracy on both YouTubeVIS-2019 (62.1\%) and YouTubeVIS-2021 (54.7\%) datasets. In addition, quantitative and qualitative results show that the proposed methods predict more temporally consistent instance segmentation masks.

preprint2022arXiv

Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation

Unsupervised domain adaptive person re-identification (ReID) has been extensively investigated to mitigate the adverse effects of domain gaps. Those works assume the target domain data can be accessible all at once. However, for the real-world streaming data, this hinders the timely adaptation to changing data statistics and sufficient exploitation of increasing samples. In this paper, to address more practical scenarios, we propose a new task, Lifelong Unsupervised Domain Adaptive (LUDA) person ReID. This is challenging because it requires the model to continuously adapt to unlabeled data in the target environments while alleviating catastrophic forgetting for such a fine-grained person retrieval task. We design an effective scheme for this task, dubbed CLUDA-ReID, where the anti-forgetting is harmoniously coordinated with the adaptation. Specifically, a meta-based Coordinated Data Replay strategy is proposed to replay old data and update the network with a coordinated optimization direction for both adaptation and memorization. Moreover, we propose Relational Consistency Learning for old knowledge distillation/inheritance in line with the objective of retrieval-based tasks. We set up two evaluation settings to simulate the practical application scenarios. Extensive experiments demonstrate the effectiveness of our CLUDA-ReID for both scenarios with stationary target streams and scenarios with dynamic target streams.

preprint2022arXiv

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.

preprint2020arXiv

Real-time 3D Deep Multi-Camera Tracking

Tracking a crowd in 3D using multiple RGB cameras is a challenging task. Most previous multi-camera tracking algorithms are designed for offline setting and have high computational complexity. Robust real-time multi-camera 3D tracking is still an unsolved problem. In this work, we propose a novel end-to-end tracking pipeline, Deep Multi-Camera Tracking (DMCT), which achieves reliable real-time multi-camera people tracking. Our DMCT consists of 1) a fast and novel perspective-aware Deep GroudPoint Network, 2) a fusion procedure for ground-plane occupancy heatmap estimation, 3) a novel Deep Glimpse Network for person detection and 4) a fast and accurate online tracker. Our design fully unleashes the power of deep neural network to estimate the "ground point" of each person in each color image, which can be optimized to run efficiently and robustly. Our fusion procedure, glimpse network and tracker merge the results from different views, find people candidates using multiple video frames and then track people on the fused heatmap. Our system achieves the state-of-the-art tracking results while maintaining real-time performance. Apart from evaluation on the challenging WILDTRACK dataset, we also collect two more tracking datasets with high-quality labels from two different environments and camera settings. Our experimental results confirm that our proposed real-time pipeline gives superior results to previous approaches.

preprint2016arXiv

Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark

Psychological research results have confirmed that people can have different emotional reactions to different visual stimuli. Several papers have been published on the problem of visual emotion analysis. In particular, attempts have been made to analyze and predict people's emotional reaction towards images. To this end, different kinds of hand-tuned features are proposed. The results reported on several carefully selected and labeled small image data sets have confirmed the promise of such features. While the recent successes of many computer vision related tasks are due to the adoption of Convolutional Neural Networks (CNNs), visual emotion analysis has not achieved the same level of success. This may be primarily due to the unavailability of confidently labeled and relatively large image data sets for visual emotion analysis. In this work, we introduce a new data set, which started from 3+ million weakly labeled images of different emotions and ended up 30 times as large as the current largest publicly available visual emotion data set. We hope that this data set encourages further research on visual emotion analysis. We also perform extensive benchmarking analyses on this large data set using the state of the art methods including CNNs.

preprint2016arXiv

Image Captioning with Semantic Attention

Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

preprint2016arXiv

The Effect of Pets on Happiness: A Data-Driven Approach via Large-Scale Social Media

Psychologists have demonstrated that pets have a positive impact on owners' happiness. For example, lonely people are often advised to have a dog or cat to quell their social isolation. Conventional psychological research methods of analyzing this phenomenon are mostly based on surveys or self-reported questionnaires, which are time-consuming and lack of scalability. Utilizing social media as an alternative and complimentary resource could potentially address both issues and provide different perspectives on this psychological investigation. In this paper, we propose a novel and effective approach that exploits social media to study the effect of pets on owners' happiness. The proposed framework includes three major components: 1) collecting user-level data from Instagram consisting of about 300,000 images from 2905 users; 2) constructing a convolutional neural network (CNN) for pets classification, and combined with timeline information, further identifying pet owners and the control group; 3) measuring the confidence score of happiness by detecting and analyzing selfie images. Furthermore, various factors of demographics are employed to analyze the fine-grained effects of pets on happiness. Our experimental results demonstrate the effectiveness of the proposed approach and we believe that this approach can be applied to other related domains as a large-scale, high-confidence methodology of user activity analysis through social media.

preprint2016arXiv

Voting with Feet: Who are Leaving Hillary Clinton and Donald Trump?

From a crowded field with 17 candidates, Hillary Clinton and Donald Trump have emerged as the two front-runners in the 2016 U.S. presidential campaign. The two candidates each boast more than 5 million followers on Twitter, and at the same time both have witnessed hundreds of thousands of people leave their camps. In this paper we attempt to characterize individuals who have left Hillary Clinton and Donald Trump between September 2015 and March 2016. Our study focuses on three dimensions of social demographics: social capital, gender, and age. Within each camp, we compare the characteristics of the current followers with former followers, i.e., individuals who have left since September 2015. We use the number of followers to measure social capital, and profile images to infer gender and age. For classifying gender, we train a convolutional neural network (CNN). For age, we use the Face++ API. Our study shows that for both candidates followers with more social capital are more likely to leave (or switch camps). For both candidates females make up a larger presence among unfollowers than among current followers. Somewhat surprisingly, the effect is particularly pronounced for Clinton. Lastly, middle-aged individuals are more likely to leave Trump, and the young are more likely to leave Hillary Clinton.

preprint2015arXiv

A Picture Tells a Thousand Words -- About You! User Interest Profiling from User Generated Visual Content

Inference of online social network users' attributes and interests has been an active research topic. Accurate identification of users' attributes and interests is crucial for improving the performance of personalization and recommender systems. Most of the existing works have focused on textual content generated by the users and have successfully used it for predicting users' interests and other identifying attributes. However, little attention has been paid to user generated visual content (images) that is becoming increasingly popular and pervasive in recent times. We posit that images posted by users on online social networks are a reflection of topics they are interested in and propose an approach to infer user attributes from images posted by them. We analyze the content of individual images and then aggregate the image-level knowledge to infer user-level interest distribution. We employ image-level similarity to propagate the label information between images, as well as utilize the image category information derived from the user created organization structure to further propagate the category-level knowledge for all images. A real life social network dataset created from Pinterest is used for evaluation and the experimental results demonstrate the effectiveness of our proposed approach.

preprint2015arXiv

Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks

Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using images and videos to express their opinions and share their experiences. Sentiment analysis of such large scale visual content can help better extract user sentiments toward events or topics, such as those in image tweets, so that prediction of sentiment from visual content is complementary to textual sentiment analysis. Motivated by the needs in leveraging large scale yet noisy training data to solve the extremely challenging problem of image sentiment analysis, we employ Convolutional Neural Networks (CNN). We first design a suitable CNN architecture for image sentiment analysis. We obtain half a million training samples by using a baseline sentiment algorithm to label Flickr images. To make use of such noisy machine labeled data, we employ a progressive strategy to fine-tune the deep network. Furthermore, we improve the performance on Twitter images by inducing domain transfer with a small number of manually labeled Twitter images. We have conducted extensive experiments on manually labeled Twitter images. The results show that the proposed CNN can achieve better performance in image sentiment analysis than competing algorithms.

Quanzeng You

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention

Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Real-time 3D Deep Multi-Camera Tracking

Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark

Image Captioning with Semantic Attention

The Effect of Pets on Happiness: A Data-Driven Approach via Large-Scale Social Media

Voting with Feet: Who are Leaving Hillary Clinton and Donald Trump?

A Picture Tells a Thousand Words -- About You! User Interest Profiling from User Generated Visual Content

Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks