Source author record

Jianfei Cai

Jianfei Cai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Networking and Internet Architecture eess.IV Information Theory math.IT Artificial Intelligence Graphics Multimedia

Catalog footprint

What is connected

38works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Class Enhancement Losses with Pseudo Labels for Zero-shot Semantic Segmentation

Recent mask proposal models have significantly improved the performance of zero-shot semantic segmentation. However, the use of a `background' embedding during training in these methods is problematic as the resulting model tends to over-learn and assign all unseen classes as the background class instead of their correct labels. Furthermore, they ignore the semantic relationship of text embeddings, which arguably can be highly informative for zero-shot prediction as seen classes may have close relationship with unseen classes. To this end, this paper proposes novel class enhancement losses to bypass the use of the background embbedding during training, and simultaneously exploit the semantic relationship between text embeddings and mask proposals by ranking the similarity scores. To further capture the relationship between seen and unseen classes, we propose an effective pseudo label generation pipeline using pretrained vision-language model. Extensive experiments on several benchmark datasets show that our method achieves overall the best performance for zero-shot semantic segmentation. Our method is flexible, and can also be applied to the challenging open-vocabulary semantic segmentation problem.

preprint2022arXiv

Dual Adaptive Transformations for Weakly Supervised Point Cloud Segmentation

Weakly supervised point cloud segmentation, i.e. semantically segmenting a point cloud with only a few labeled points in the whole 3D scene, is highly desirable due to the heavy burden of collecting abundant dense annotations for the model training. However, existing methods remain challenging to accurately segment 3D point clouds since limited annotated data may lead to insufficient guidance for label propagation to unlabeled data. Considering the smoothness-based methods have achieved promising progress, in this paper, we advocate applying the consistency constraint under various perturbations to effectively regularize unlabeled 3D points. Specifically, we propose a novel DAT (\textbf{D}ual \textbf{A}daptive \textbf{T}ransformations) model for weakly supervised point cloud segmentation, where the dual adaptive transformations are performed via an adversarial strategy at both point-level and region-level, aiming at enforcing the local and structural smoothness constraints on 3D point clouds. We evaluate our proposed DAT model with two popular backbones on the large-scale S3DIS and ScanNet-V2 datasets. Extensive experiments demonstrate that our model can effectively leverage the unlabeled 3D points and achieve significant performance gains on both datasets, setting new state-of-the-art performance for weakly supervised point cloud segmentation.

preprint2022arXiv

Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation

Semi-supervised segmentation remains challenging in medical imaging since the amount of annotated medical data is often scarce and there are many blurred pixels near the adhesive edges or in the low-contrast regions. To address the issues, we advocate to firstly constrain the consistency of pixels with and without strong perturbations to apply a sufficient smoothness constraint and further encourage the class-level separation to exploit the low-entropy regularization for the model training. Particularly, in this paper, we propose the SS-Net for semi-supervised medical image segmentation tasks, via exploring the pixel-level smoothness and inter-class separation at the same time. The pixel-level smoothness forces the model to generate invariant results under adversarial perturbations. Meanwhile, the inter-class separation encourages individual class features should approach their corresponding high-quality prototypes, in order to make each class distribution compact and separate different classes. We evaluated our SS-Net against five recent methods on the public LA and ACDC datasets. Extensive experimental results under two semi-supervised settings demonstrate the superiority of our proposed SS-Net model, achieving new state-of-the-art (SOTA) performance on both datasets. The code is available at https://github.com/ycwu1997/SS-Net.

preprint2022arXiv

FocusFormer: Focusing on What We Need via Architecture Sampler

Vision Transformers (ViTs) have underpinned the recent breakthroughs in computer vision. However, designing the architectures of ViTs is laborious and heavily relies on expert knowledge. To automate the design process and incorporate deployment flexibility, one-shot neural architecture search decouples the supernet training and architecture specialization for diverse deployment scenarios. To cope with an enormous number of sub-networks in the supernet, existing methods treat all architectures equally important and randomly sample some of them in each update step during training. During architecture search, these methods focus on finding architectures on the Pareto frontier of performance and resource consumption, which forms a gap between training and deployment. In this paper, we devise a simple yet effective method, called FocusFormer, to bridge such a gap. To this end, we propose to learn an architecture sampler to assign higher sampling probabilities to those architectures on the Pareto frontier under different resource constraints during supernet training, making them sufficiently optimized and hence improving their performance. During specialization, we can directly use the well-trained architecture sampler to obtain accurate architectures satisfying the given resource constraint, which significantly improves the search efficiency. Extensive experiments on CIFAR-100 and ImageNet show that our FocusFormer is able to improve the performance of the searched architectures while significantly reducing the search cost. For example, on ImageNet, our FocusFormer-Ti with 1.4G FLOPs outperforms AutoFormer-Ti by 0.5% in terms of the Top-1 accuracy.

preprint2022arXiv

GMFlow: Learning Optical Flow via Global Matching

Learning-based optical flow estimation has been dominated with the pipeline of cost volume with convolutions for flow regression, which is inherently limited to local correlations and thus is hard to address the long-standing challenge of large displacements. To alleviate this, the state-of-the-art framework RAFT gradually improves its prediction quality by using a large number of iterative refinements, achieving remarkable performance but introducing linearly increasing inference time. To enable both high accuracy and efficiency, we completely revamp the dominant flow regression pipeline by reformulating optical flow as a global matching problem, which identifies the correspondences by directly comparing feature similarities. Specifically, we propose a GMFlow framework, which consists of three main components: a customized Transformer for feature enhancement, a correlation and softmax layer for global feature matching, and a self-attention layer for flow propagation. We further introduce a refinement step that reuses GMFlow at higher feature resolution for residual flow prediction. Our new framework outperforms 31-refinements RAFT on the challenging Sintel benchmark, while using only one refinement and running faster, suggesting a new paradigm for accurate and efficient optical flow estimation. Code is available at https://github.com/haofeixu/gmflow.

preprint2022arXiv

High-Quality Pluralistic Image Completion via Code Shared VQGAN

PICNet pioneered the generation of multiple and diverse results for image completion task, but it required a careful balance between $\mathcal{KL}$ loss (diversity) and reconstruction loss (quality), resulting in a limited diversity and quality . Separately, iGPT-based architecture has been employed to infer distributions in a discrete space derived from a pixel-level pre-clustered palette, which however cannot generate high-quality results directly. In this work, we present a novel framework for pluralistic image completion that can achieve both high quality and diversity at much faster inference speed. The core of our design lies in a simple yet effective code sharing mechanism that leads to a very compact yet expressive image representation in a discrete latent domain. The compactness and the richness of the representation further facilitate the subsequent deployment of a transformer to effectively learn how to composite and complete a masked image at the discrete code domain. Based on the global context well-captured by the transformer and the available visual regions, we are able to sample all tokens simultaneously, which is completely different from the prevailing autoregressive approach of iGPT-based works, and leads to more than 100$\times$ faster inference speed. Experiments show that our framework is able to learn semantically-rich discrete codes efficiently and robustly, resulting in much better image reconstruction quality. Our diverse image completion framework significantly outperforms the state-of-the-art both quantitatively and qualitatively on multiple benchmark datasets.

preprint2022arXiv

Image Captioning In the Transformer Age

Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning.

preprint2022arXiv

MED-TEX: Transferring and Explaining Knowledge with Less Data from Pretrained Medical Imaging Models

Deep learning methods usually require a large amount of training data and lack interpretability. In this paper, we propose a novel knowledge distillation and model interpretation framework for medical image classification that jointly solves the above two issues. Specifically, to address the data-hungry issue, a small student model is learned with less data by distilling knowledge from a cumbersome pretrained teacher model. To interpret the teacher model and assist the learning of the student, an explainer module is introduced to highlight the regions of an input that are important for the predictions of the teacher model. Furthermore, the joint framework is trained by a principled way derived from the information-theoretic perspective. Our framework outperforms on the knowledge distillation and model interpretation tasks compared to state-of-the-art methods on a fundus dataset.

preprint2022arXiv

Mesa: A Memory-saving Training Framework for Transformers

There has been an explosion of interest in designing high-performance Transformers. While Transformers have delivered significant performance improvements, training such networks is extremely memory intensive owing to storing all intermediate activations that are needed for gradient computation during backpropagation, especially for long sequences. To this end, we present Mesa, a memory-saving training framework for Transformers. Specifically, Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. The low-precision activations are then dequantized during back-propagation to compute gradients. Besides, to address the heterogeneous activation distributions in the multi-head self-attention layers, we propose a head-wise activation quantization strategy, which quantizes activations based on the statistics of each head to minimize the approximation error. To further boost training efficiency, we learn quantization parameters by running estimates. More importantly, by re-investing the saved memory in employing a larger batch size or scaling up model size, we may further improve the performance under constrained computational resources. Extensive experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can achieve flexible memory-savings (up to 50%) during training while achieving comparable or even better performance. Code is available at https://github.com/ziplab/Mesa.

preprint2022arXiv

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modelling the temporal context explicitly. Specifically, MTVM enables the agent to keep track of the navigation trajectory by directly storing previous activations in a memory bank. To further boost the performance, we propose a memory-aware consistency loss to help learn a better joint representation of temporal context with random masked instructions. We evaluate MTVM on popular R2R and CVDN datasets, and our model improves Success Rate on R2R unseen validation and test set by 2% each, and reduce Goal Process by 1.6m on CVDN test set.

preprint2022arXiv

Mutual Consistency Learning for Semi-supervised Medical Image Segmentation

In this paper, we propose a novel mutual consistency network (MC-Net+) to effectively exploit the unlabeled data for semi-supervised medical image segmentation. The MC-Net+ model is motivated by the observation that deep models trained with limited annotations are prone to output highly uncertain and easily mis-classified predictions in the ambiguous regions (e.g., adhesive edges or thin branches) for medical image segmentation. Leveraging these challenging samples can make the semi-supervised segmentation model training more effective. Therefore, our proposed MC-Net+ model consists of two new designs. First, the model contains one shared encoder and multiple slightly different decoders (i.e., using different up-sampling strategies). The statistical discrepancy of multiple decoders' outputs is computed to denote the model's uncertainty, which indicates the unlabeled hard regions. Second, we apply a novel mutual consistency constraint between one decoder's probability output and other decoders' soft pseudo labels. In this way, we minimize the discrepancy of multiple outputs (i.e., the model uncertainty) during training and force the model to generate invariant results in such challenging regions, aiming at regularizing the model training. We compared the segmentation results of our MC-Net+ model with five state-of-the-art semi-supervised approaches on three public medical datasets. Extension experiments with two standard semi-supervised settings demonstrate the superior performance of our model over other methods, which sets a new state of the art for semi-supervised medical image segmentation. Our code is released publicly at https://github.com/ycwu1997/MC-Net.

preprint2022arXiv

Object-Compositional Neural Implicit Surfaces

The neural implicit representation has shown its effectiveness in novel view synthesis and high-quality 3D reconstruction from multi-view images. However, most approaches focus on holistic scene representation yet ignore individual objects inside it, thus limiting potential downstream applications. In order to learn object-compositional representation, a few works incorporate the 2D semantic map as a cue in training to grasp the difference between objects. But they neglect the strong connections between object geometry and instance semantic information, which leads to inaccurate modeling of individual instance. This paper proposes a novel framework, ObjectSDF, to build an object-compositional neural implicit representation with high fidelity in 3D reconstruction and object representation. Observing the ambiguity of conventional volume rendering pipelines, we model the scene by combining the Signed Distance Functions (SDF) of individual object to exert explicit surface constraint. The key in distinguishing different instances is to revisit the strong association between an individual object's SDF and semantic label. Particularly, we convert the semantic information to a function of object SDF and develop a unified and compact representation for scene and objects. Experimental results show the superiority of ObjectSDF framework in representing both the holistic object-compositional scene and the individual instances. Code can be found at https://qianyiwu.github.io/objectsdf/

preprint2022arXiv

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues

Object proposal generation is an important and fundamental task in computer vision. In this paper, we propose ProposalCLIP, a method towards unsupervised open-category object proposal generation. Unlike previous works which require a large number of bounding box annotations and/or can only generate proposals for limited object categories, our ProposalCLIP is able to predict proposals for a large variety of object categories without annotations, by exploiting CLIP (contrastive language-image pre-training) cues. Firstly, we analyze CLIP for unsupervised open-category proposal generation and design an objectness score based on our empirical analysis on proposal selection. Secondly, a graph-based merging module is proposed to solve the limitations of CLIP cues and merge fragmented proposals. Finally, we present a proposal regression module that extracts pseudo labels based on CLIP cues and trains a lightweight network to further refine proposals. Extensive experiments on PASCAL VOC, COCO and Visual Genome datasets show that our ProposalCLIP can better generate proposals than previous state-of-the-art methods. Our ProposalCLIP also shows benefits for downstream tasks, such as unsupervised object detection.

preprint2022arXiv

Rapid Elastic Architecture Search under Specialized Classes and Resource Constraints

In many real-world applications, we often need to handle various deployment scenarios, where the resource constraint and the superclass of interest corresponding to a group of classes are dynamically specified. How to efficiently deploy deep models for diverse deployment scenarios is a new challenge. Previous NAS approaches seek to design architectures for all classes simultaneously, which may not be optimal for some individual superclasses. A straightforward solution is to search an architecture from scratch for each deployment scenario, which however is computation-intensive and impractical. To address this, we present a novel and general framework, called Elastic Architecture Search (EAS), permitting instant specializations at runtime for diverse superclasses with various resource constraints. To this end, we first propose to effectively train an over-parameterized network via a superclass dropout strategy during training. In this way, the resulting model is robust to the subsequent superclasses dropping at inference time. Based on the well-trained over-parameterized network, we then propose an efficient architecture generator to obtain promising architectures within a single forward pass. Experiments on three image classification datasets show that EAS is able to find more compact networks with better performance while remarkably being orders of magnitude faster than state-of-the-art NAS methods, e.g., outperforming OFA (once-for-all) by 1.3% on Top-1 accuracy at a budget around 361M #MAdds on ImageNet-10. More critically, EAS is able to find compact architectures within 0.1 second for 50 deployment scenarios.

preprint2022arXiv

Towards Unbiased Visual Emotion Recognition via Causal Intervention

Although much progress has been made in visual emotion recognition, researchers have realized that modern deep networks tend to exploit dataset characteristics to learn spurious statistical associations between the input and the target. Such dataset characteristics are usually treated as dataset bias, which damages the robustness and generalization performance of these recognition systems. In this work, we scrutinize this problem from the perspective of causal inference, where such dataset characteristic is termed as a confounder which misleads the system to learn the spurious correlation. To alleviate the negative effects brought by the dataset bias, we propose a novel Interventional Emotion Recognition Network (IERN) to achieve the backdoor adjustment, which is one fundamental deconfounding technique in causal inference. Specifically, IERN starts by disentangling the dataset-related context feature from the actual emotion feature, where the former forms the confounder. The emotion feature will then be forced to see each confounder stratum equally before being fed into the classifier. A series of designed tests validate the efficacy of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms state-of-the-art approaches for unbiased visual emotion recognition. Code is available at https://github.com/donydchen/causal_emotion

preprint2022arXiv

Transformer Scale Gate for Semantic Segmentation

Effectively encoding multi-scale contextual information is crucial for accurate semantic segmentation. Existing transformer-based segmentation models combine features across scales without any selection, where features on sub-optimal scales may degrade segmentation outcomes. Leveraging from the inherent properties of Vision Transformers, we propose a simple yet effective module, Transformer Scale Gate (TSG), to optimally combine multi-scale features.TSG exploits cues in self and cross attentions in Vision Transformers for the scale selection. TSG is a highly flexible plug-and-play module, and can easily be incorporated with any encoder-decoder-based hierarchical vision Transformer architecture. Extensive experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.

preprint2022arXiv

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. A cross-modal object matching (COM) module is further introduced, which exploits the recently emerged image-text matching pretrained model, CLIP, to predict the target objects from a bottom-up perspective. The top-down and bottom-up predictions are then integrated via a similarity funsion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.

preprint2021arXiv

Causal Attention for Vision-Language Tasks

We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.

preprint2020arXiv

Disentangled Human Body Embedding Based on Deep Hierarchical Neural Network

Human bodies exhibit various shapes for different identities or poses, but the body shape has certain similarities in structure and thus can be embedded in a low-dimensional space. This paper presents an autoencoder-like network architecture to learn disentangled shape and pose embedding specifically for the 3D human body. This is inspired by recent progress of deformation-based latent representation learning. To improve the reconstruction accuracy, we propose a hierarchical reconstruction pipeline for the disentangling process and construct a large dataset of human body models with consistent connectivity for the learning of the neural network. Our learned embedding can not only achieve superior reconstruction accuracy but also provide great flexibility in 3D human body generation via interpolation, bilinear interpolation, and latent space sampling. The results from extensive experiments demonstrate the powerfulness of our learned 3D human body embedding in various applications.

preprint2020arXiv

Expert Training: Task Hardness Aware Meta-Learning for Few-Shot Classification

Deep neural networks are highly effective when a large number of labeled samples are available but fail with few-shot classification tasks. Recently, meta-learning methods have received much attention, which train a meta-learner on massive additional tasks to gain the knowledge to instruct the few-shot classification. Usually, the training tasks are randomly sampled and performed indiscriminately, often making the meta-learner stuck into a bad local optimum. Some works in the optimization of deep neural networks have shown that a better arrangement of training data can make the classifier converge faster and perform better. Inspired by this idea, we propose an easy-to-hard expert meta-training strategy to arrange the training tasks properly, where easy tasks are preferred in the first phase, then, hard tasks are emphasized in the second phase. A task hardness aware module is designed and integrated into the training procedure to estimate the hardness of a task based on the distinguishability of its categories. In addition, we explore multiple hardness measurements including the semantic relation, the pairwise Euclidean distance, the Hausdorff distance, and the Hilbert-Schmidt independence criterion. Experimental results on the miniImageNet and tieredImageNetSketch datasets show that the meta-learners can obtain better results with our expert training strategy.

preprint2020arXiv

Exploring Bottom-up and Top-down Cues with Attentive Learning for Webly Supervised Object Detection

Fully supervised object detection has achieved great success in recent years. However, abundant bounding boxes annotations are needed for training a detector for novel classes. To reduce the human labeling effort, we propose a novel webly supervised object detection (WebSOD) method for novel classes which only requires the web images without further annotations. Our proposed method combines bottom-up and top-down cues for novel class detection. Within our approach, we introduce a bottom-up mechanism based on the well-trained fully supervised object detector (i.e. Faster RCNN) as an object region estimator for web images by recognizing the common objectiveness shared by base and novel classes. With the estimated regions on the web images, we then utilize the top-down attention cues as the guidance for region classification. Furthermore, we propose a residual feature refinement (RFR) block to tackle the domain mismatch between web domain and the target domain. We demonstrate our proposed method on PASCAL VOC dataset with three different novel/base splits. Without any target-domain novel-class images and annotations, our proposed webly supervised object detection model is able to achieve promising performance for novel classes. Moreover, we also conduct transfer learning experiments on large scale ILSVRC 2013 detection dataset and achieve state-of-the-art performance.

preprint2020arXiv

Image Co-skeletonization via Co-segmentation

Recent advances in the joint processing of images have certainly shown its advantages over individual processing. Different from the existing works geared towards co-segmentation or co-localization, in this paper, we explore a new joint processing topic: image co-skeletonization, which is defined as joint skeleton extraction of objects in an image collection. Object skeletonization in a single natural image is a challenging problem because there is hardly any prior knowledge about the object. Therefore, we resort to the idea of object co-skeletonization, hoping that the commonness prior that exists across the images may help, just as it does for other joint processing problems such as co-segmentation. We observe that the skeleton can provide good scribbles for segmentation, and skeletonization, in turn, needs good segmentation. Therefore, we propose a coupled framework for co-skeletonization and co-segmentation tasks so that they are well informed by each other, and benefit each other synergistically. Since it is a new problem, we also construct a benchmark dataset by annotating nearly 1.8k images spread across 38 categories. Extensive experiments demonstrate that the proposed method achieves promising results in all the three possible scenarios of joint-processing: weakly-supervised, supervised, and unsupervised.

preprint2020arXiv

Large-scale Heteroscedastic Regression via Gaussian Process

Heteroscedastic regression considering the varying noises among observations has many applications in the fields like machine learning and statistics. Here we focus on the heteroscedastic Gaussian process (HGP) regression which integrates the latent function and the noise function together in a unified non-parametric Bayesian framework. Though showing remarkable performance, HGP suffers from the cubic time complexity, which strictly limits its application to big data. To improve the scalability, we first develop a variational sparse inference algorithm, named VSHGP, to handle large-scale datasets. Furthermore, two variants are developed to improve the scalability and capability of VSHGP. The first is stochastic VSHGP (SVSHGP) which derives a factorized evidence lower bound, thus enhancing efficient stochastic variational inference. The second is distributed VSHGP (DVSHGP) which (i) follows the Bayesian committee machine formalism to distribute computations over multiple local VSHGP experts with many inducing points; and (ii) adopts hybrid parameters for experts to guard against over-fitting and capture local variety. The superiority of DVSHGP and SVSHGP as compared to existing scalable heteroscedastic/homoscedastic GPs is then extensively verified on various datasets.

preprint2020arXiv

Learning from the Scene and Borrowing from the Rich: Tackling the Long Tail in Scene Graph Generation

Despite the huge progress in scene graph generation in recent years, its long-tail distribution in object relationships remains a challenging and pestering issue. Existing methods largely rely on either external knowledge or statistical bias information to alleviate this problem. In this paper, we tackle this issue from another two aspects: (1) scene-object interaction aiming at learning specific knowledge from a scene via an additive attention mechanism; and (2) long-tail knowledge transfer which tries to transfer the rich knowledge learned from the head into the tail. Extensive experiments on the benchmark dataset Visual Genome on three tasks demonstrate that our method outperforms current state-of-the-art competitors.

preprint2020arXiv

Modeling Caricature Expressions by 3D Blendshape and Dynamic Texture

The problem of deforming an artist-drawn caricature according to a given normal face expression is of interest in applications such as social media, animation and entertainment. This paper presents a solution to the problem, with an emphasis on enhancing the ability to create desired expressions and meanwhile preserve the identity exaggeration style of the caricature, which imposes challenges due to the complicated nature of caricatures. The key of our solution is a novel method to model caricature expression, which extends traditional 3DMM representation to caricature domain. The method consists of shape modelling and texture generation for caricatures. Geometric optimization is developed to create identity-preserving blendshapes for reconstructing accurate and stable geometric shape, and a conditional generative adversarial network (cGAN) is designed for generating dynamic textures under target expressions. The combination of both shape and texture components makes the non-trivial expressions of a caricature be effectively defined by the extension of the popular 3DMM representation and a caricature can thus be flexibly deformed into arbitrary expressions with good results visually in both shape and color spaces. The experiments demonstrate the effectiveness of the proposed method.

preprint2016arXiv

Exploit Bounding Box Annotations for Multi-label Object Recognition

Convolutional neural networks (CNNs) have shown great performance as general feature representations for object recognition applications. However, for multi-label images that contain multiple objects from different categories, scales and locations, global CNN features are not optimal. In this paper, we incorporate local information to enhance the feature discriminative power. In particular, we first extract object proposals from each image. With each image treated as a bag and object proposals extracted from it treated as instances, we transform the multi-label recognition problem into a multi-class multi-instance learning problem. Then, in addition to extracting the typical CNN feature representation from each proposal, we propose to make use of ground-truth bounding box annotations (strong labels) to add another level of local information by using nearest-neighbor relationships of local regions to form a multi-view pipeline. The proposed multi-view multi-instance framework utilizes both weak and strong labels effectively, and more importantly it has the generalization ability to even boost the performance of unseen categories by partial strong labels from other categories. Our framework is extensively compared with state-of-the-art hand-crafted feature based methods and CNN based methods on two multi-label benchmark datasets. The experimental results validate the discriminative power and the generalization ability of the proposed framework. With strong labels, our framework is able to achieve state-of-the-art results in both datasets.

preprint2016arXiv

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations

Multi-label learning has attracted significant interests in computer vision recently, finding applications in many vision tasks such as multiple object recognition and automatic image annotation. Associating multiple labels to a complex image is very difficult, not only due to the intricacy of describing the image, but also because of the incompleteness nature of the observed labels. Existing works on the problem either ignore the label-label and instance-instance correlations or just assume these correlations are linear and unstructured. Considering that semantic correlations between images are actually structured, in this paper we propose to incorporate structured semantic correlations to solve the missing label problem of multi-label learning. Specifically, we project images to the semantic space with an effective semantic descriptor. A semantic graph is then constructed on these images to capture the structured correlations between them. We utilize the semantic graph Laplacian as a smooth term in the multi-label learning formulation to incorporate the structured semantic correlations. Experimental results demonstrate the effectiveness of the proposed semantic descriptor and the usefulness of incorporating the structured semantic correlations. We achieve better results than state-of-the-art multi-label learning methods on four benchmark datasets.

preprint2015arXiv

Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image Segmentation and Cosegmentation

Image segmentation refers to the process to divide an image into nonoverlapping meaningful regions according to human perception, which has become a classic topic since the early ages of computer vision. A lot of research has been conducted and has resulted in many applications. However, while many segmentation algorithms exist, yet there are only a few sparse and outdated summarizations available, an overview of the recent achievements and issues is lacking. We aim to provide a comprehensive review of the recent progress in this field. Covering 180 publications, we give an overview of broad areas of segmentation topics including not only the classic bottom-up approaches, but also the recent development in superpixel, interactive methods, object proposals, semantic image parsing and image cosegmentation. In addition, we also review the existing influential datasets and evaluation metrics. Finally, we suggest some design flavors and research directions for future research in image segmentation.

preprint2015arXiv

Diagnosing State-Of-The-Art Object Proposal Methods

Object proposal has become a popular paradigm to replace exhaustive sliding window search in current top-performing methods in PASCAL VOC and ImageNet. Recently, Hosang et al. conduct the first unified study of existing methods' in terms of various image-level degradations. On the other hand, the vital question "what object-level characteristics really affect existing methods' performance?" is not yet answered. Inspired by Hoiem et al.'s work in categorical object detection, this paper conducts the first meta-analysis of various object-level characteristics' impact on state-of-the-art object proposal methods. Specifically, we examine the effects of object size, aspect ratio, iconic view, color contrast, shape regularity and texture. We also analyse existing methods' localization accuracy and latency for various PASCAL VOC object classes. Our study reveals the limitations of existing methods in terms of non-iconic view, small object size, low color contrast, shape regularity etc. Based on our observations, lessons are also learned and shared with respect to the selection of existing object proposal technologies as well as the design of the future ones.

preprint2015arXiv

Robust Performance-driven 3D Face Tracking in Long Range Depth Scenes

We introduce a novel robust hybrid 3D face tracking framework from RGBD video streams, which is capable of tracking head pose and facial actions without pre-calibration or intervention from a user. In particular, we emphasize on improving the tracking performance in instances where the tracked subject is at a large distance from the cameras, and the quality of point cloud deteriorates severely. This is accomplished by the combination of a flexible 3D shape regressor and the joint 2D+3D optimization on shape parameters. Our approach fits facial blendshapes to the point cloud of the human head, while being driven by an efficient and rapid 3D shape regressor trained on generic RGB datasets. As an on-line tracking system, the identity of the unknown user is adapted on-the-fly resulting in improved 3D model reconstruction and consequently better tracking performance. The result is a robust RGBD face tracker, capable of handling a wide range of target scene depths, beyond those that can be afforded by traditional depth or RGB face trackers. Lastly, since the blendshape is not able to accurately recover the real facial shape, we use the tracked 3D face model as a prior in a novel filtering process to further refine the depth map for use in other tasks, such as 3D reconstruction.

preprint2015arXiv

Weakly Supervised Fine-Grained Image Categorization

In this paper, we categorize fine-grained images without using any object / part annotation neither in the training nor in the testing stage, a step towards making it suitable for deployments. Fine-grained image categorization aims to classify objects with subtle distinctions. Most existing works heavily rely on object / part detectors to build the correspondence between object parts by using object or object part annotations inside training images. The need for expensive object annotations prevents the wide usage of these methods. Instead, we propose to select useful parts from multi-scale part proposals in objects, and use them to compute a global image representation for categorization. This is specially designed for the annotation-free fine-grained categorization task, because useful parts have shown to play an important role in existing annotation-dependent works but accurate part detectors can be hardly acquired. With the proposed image representation, we can further detect and visualize the key (most discriminative) parts in objects of different classes. In the experiment, the proposed annotation-free method achieves better accuracy than that of state-of-the-art annotation-free and most existing annotation-dependent methods on two challenging datasets, which shows that it is not always necessary to use accurate object / part annotations in fine-grained image categorization.

preprint2014arXiv

Maximum Multipath Routing Throughput in Multirate Wireless Mesh Networks

In this paper, we consider the problem of finding the maximum routing throughput between any pair of nodes in an arbitrary multirate wireless mesh network (WMN) using multiple paths. Multipath routing is an efficient technique to maximize routing throughput in WMN, however maximizing multipath routing throughput is a NP-complete problem due to the shared medium for electromagnetic wave transmission in wireless channel, inducing collision-free scheduling as part of the optimization problem. In this work, we first provide problem formulation that incorporates collision-free schedule, and then based on this formulation we design an algorithm with search pruning that jointly optimizes paths and transmission schedule. Though suboptimal, compared to the known optimal single path flow, we demonstrate that an efficient multipath routing scheme can increase the routing throughput by up to 100% for simple WMNs.

preprint2013arXiv

Primer and Recent Developments on Fountain Codes

In this paper we survey the various erasure codes which had been proposed and patented recently, and along the survey we provide introductory tutorial on many of the essential concepts and readings in erasure and Fountain codes. Packet erasure is a fundamental characteristic inherent in data storage and data transmission system. Traditionally replication/ retransmission based techniques had been employed to deal with packet erasures in such systems. While the Reed-Solomon (RS) erasure codes had been known for quite some time to improve system reliability and reduce data redundancy, the high decoding computation cost of RS codes has offset wider implementation of RS codes. However recent exponential growth in data traffic and demand for larger data storage capacity has simmered interest in erasure codes. Recent results have shown promising results to address the decoding computation complexity and redundancy tradeoff inherent in erasure codes.

preprint2012arXiv

Optimal Solution for the Index Coding Problem Using Network Coding over GF(2)

The index coding problem is a fundamental transmission problem which occurs in a wide range of multicast networks. Network coding over a large finite field size has been shown to be a theoretically efficient solution to the index coding problem. However the high computational complexity of packet encoding and decoding over a large finite field size, and its subsequent penalty on encoding and decoding throughput and higher energy cost makes it unsuitable for practical implementation in processor and energy constraint devices like mobile phones and wireless sensors. While network coding over GF(2) can alleviate these concerns, it comes at a tradeoff cost of degrading throughput performance. To address this tradeoff, we propose a throughput optimal triangular network coding scheme over GF(2). We show that such a coding scheme can supply unlimited number of innovative packets and the decoding involves the simple back substitution. Such a coding scheme provides an efficient solution to the index coding problem and its lower computation and energy cost makes it suitable for practical implementation on devices with limited processing and energy capacity.

preprint2011arXiv

Cooperative Retransmissions Through Collisions

Interference in wireless networks is one of the key capacity-limiting factors. Recently developed interference-embracing techniques show promising performance on turning collisions into useful transmissions. However, the interference-embracing techniques are hard to apply in practical applications due to their strict requirements. In this paper, we consider utilising the interference-embracing techniques in a common scenario of two interfering sender-receiver pairs. By employing opportunistic listening and analog network coding (ANC), we show that compared to traditional ARQ retransmission, a higher retransmission throughput can be achieved by allowing two interfering senders to cooperatively retransmit selected lost packets at the same time. This simultaneous retransmission is facilitated by a simple handshaking procedure without introducing additional overhead. Simulation results demonstrate the superior performance of the proposed cooperative retransmission.

preprint2011arXiv

Joint Network Coding for Interfering Wireless Multicast Networks

Interference in wireless networks is one of the key-capacity limiting factor. The multicast capacity of an ad- hoc wireless network decreases with an increasing number of transmitting and/or receiving nodes within a fixed area. Digital Network Coding (DNC) has been shown to improve the multicast capacity of non-interfering wireless network. However recently proposed Physical-layer Network Coding (PNC) and Analog Network Coding (ANC) has shown that it is possible to decode an unknown packet from the collision of two packet, when one of the colliding packet is known a priori. Taking advantage of such collision decoding scheme, in this paper we propose a Joint Network Coding based Cooperative Retransmission (JNC- CR) scheme, where we show that ANC along with DNC can offer a much higher retransmission gain than that attainable through either ANC, DNC or Automatic Repeat reQuest (ARQ) based retransmission. This scheme can be applied for two wireless multicast groups interfering with each other. Because of the broadcast nature of the wireless transmission, receivers of different multicast group can opportunistically listen and cache packets from the interfering transmitter. These cached packets, along with the packets the receiver receives from its transmitter can then be used for decoding the JNC packet. We validate the higher retransmission gain performance of JNC with an optimal DNC scheme, using simulation.

preprint2010arXiv

An Efficient Network Coding based Retransmission Algorithm for Wireless Multicasts

Retransmission based on packet acknowledgement (ACK/NAK) is a fundamental error control technique employed in IEEE 802.11-2007 unicast network. However the 802.11-2007 standard falls short of proposing a reliable MAC-level recovery protocol for multicast frames. In this paper we propose a latency and bandwidth efficient coding algorithm based on the principles of network coding for retransmitting lost packets in a singlehop wireless multicast network and demonstrate its effectiveness over previously proposed network coding based retransmission algorithms.

preprint2010arXiv

Collision Codes: Decoding Superimposed BPSK Modulated Wireless Transmissions

The introduction of physical layer network coding gives rise to the concept of turning a collision of transmissions on a wireless channel useful. In the idea of physical layer network coding, two synchronized simultaneous packet transmissions are carefully encoded such that the superimposed transmission can be decoded to produce a packet which is identical to the bitwise binary sum of the two transmitted packets. This paper explores the decoding of superimposed transmission resulted by multiple synchronized simultaneous transmissions. We devise a coding scheme that achieves the identification of individual transmission from the synchronized superimposed transmission. A mathematical proof for the existence of such a coding scheme is given.

Jianfei Cai

What is connected

Connect this record

See the researcher in context

Building this map preview

38 published item(s)

Class Enhancement Losses with Pseudo Labels for Zero-shot Semantic Segmentation

Dual Adaptive Transformations for Weakly Supervised Point Cloud Segmentation

Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation

FocusFormer: Focusing on What We Need via Architecture Sampler

GMFlow: Learning Optical Flow via Global Matching

High-Quality Pluralistic Image Completion via Code Shared VQGAN

Image Captioning In the Transformer Age

MED-TEX: Transferring and Explaining Knowledge with Less Data from Pretrained Medical Imaging Models

Mesa: A Memory-saving Training Framework for Transformers

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Mutual Consistency Learning for Semi-supervised Medical Image Segmentation

Object-Compositional Neural Implicit Surfaces

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues

Rapid Elastic Architecture Search under Specialized Classes and Resource Constraints

Towards Unbiased Visual Emotion Recognition via Causal Intervention

Transformer Scale Gate for Semantic Segmentation

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Causal Attention for Vision-Language Tasks

Disentangled Human Body Embedding Based on Deep Hierarchical Neural Network

Expert Training: Task Hardness Aware Meta-Learning for Few-Shot Classification

Exploring Bottom-up and Top-down Cues with Attentive Learning for Webly Supervised Object Detection

Image Co-skeletonization via Co-segmentation

Large-scale Heteroscedastic Regression via Gaussian Process

Learning from the Scene and Borrowing from the Rich: Tackling the Long Tail in Scene Graph Generation

Modeling Caricature Expressions by 3D Blendshape and Dynamic Texture

Exploit Bounding Box Annotations for Multi-label Object Recognition

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations

Beyond Pixels: A Comprehensive Survey from Bottom-up to Semantic Image Segmentation and Cosegmentation

Diagnosing State-Of-The-Art Object Proposal Methods

Robust Performance-driven 3D Face Tracking in Long Range Depth Scenes

Weakly Supervised Fine-Grained Image Categorization

Maximum Multipath Routing Throughput in Multirate Wireless Mesh Networks

Primer and Recent Developments on Fountain Codes

Optimal Solution for the Index Coding Problem Using Network Coding over GF(2)

Cooperative Retransmissions Through Collisions

Joint Network Coding for Interfering Wireless Multicast Networks

An Efficient Network Coding based Retransmission Algorithm for Wireless Multicasts

Collision Codes: Decoding Superimposed BPSK Modulated Wireless Transmissions