Source author record

Tieniu Tan

Tieniu Tan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence Machine Learning Information Retrieval Social and Information Networks

Catalog footprint

What is connected

19works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Cross-Domain Cross-Set Few-Shot Learning via Learning Compact and Aligned Representations

Few-shot learning (FSL) aims to recognize novel queries with only a few support samples through leveraging prior knowledge from a base dataset. In this paper, we consider the domain shift problem in FSL and aim to address the domain gap between the support set and the query set. Different from previous cross-domain FSL work (CD-FSL) that considers the domain shift between base and novel classes, the new problem, termed cross-domain cross-set FSL (CDSC-FSL), requires few-shot learners not only to adapt to the new domain, but also to be consistent between different domains within each novel class. To this end, we propose a novel approach, namely stabPA, to learn prototypical compact and cross-domain aligned representations, so that the domain shift and few-shot learning can be addressed simultaneously. We evaluate our approach on two new CDCS-FSL benchmarks built from the DomainNet and Office-Home datasets respectively. Remarkably, our approach outperforms multiple elaborated baselines by a large margin, e.g., improving 5-shot accuracy by 6.0 points on average on DomainNet. Code is available at https://github.com/WentaoChen0813/CDCS-FSL

preprint2022arXiv

Disentangled Federated Learning for Tackling Attributes Skew via Invariant Aggregation and Diversity Transferring

Attributes skew hinders the current federated learning (FL) frameworks from consistent optimization directions among the clients, which inevitably leads to performance reduction and unstable convergence. The core problems lie in that: 1) Domain-specific attributes, which are non-causal and only locally valid, are indeliberately mixed into global aggregation. 2) The one-stage optimizations of entangled attributes cannot simultaneously satisfy two conflicting objectives, i.e., generalization and personalization. To cope with these, we proposed disentangled federated learning (DFL) to disentangle the domain-specific and cross-invariant attributes into two complementary branches, which are trained by the proposed alternating local-global optimization independently. Importantly, convergence analysis proves that the FL system can be stably converged even if incomplete client models participate in the global aggregation, which greatly expands the application scope of FL. Extensive experiments verify that DFL facilitates FL with higher performance, better interpretability, and faster convergence rate, compared with SOTA FL methods on both manually synthesized and realistic attributes skew datasets.

preprint2022arXiv

Focal and Efficient IOU Loss for Accurate Bounding Box Regression

In object detection, bounding box regression (BBR) is a crucial step that determines the object localization performance. However, we find that most previous loss functions for BBR have two main drawbacks: (i) Both $\ell_n$-norm and IOU-based loss functions are inefficient to depict the objective of BBR, which leads to slow convergence and inaccurate regression results. (ii) Most of the loss functions ignore the imbalance problem in BBR that the large number of anchor boxes which have small overlaps with the target boxes contribute most to the optimization of BBR. To mitigate the adverse effects caused thereby, we perform thorough studies to exploit the potential of BBR losses in this paper. Firstly, an Efficient Intersection over Union (EIOU) loss is proposed, which explicitly measures the discrepancies of three geometric factors in BBR, i.e., the overlap area, the central point and the side length. After that, we state the Effective Example Mining (EEM) problem and propose a regression version of focal loss to make the regression process focus on high-quality anchor boxes. Finally, the above two parts are combined to obtain a new loss function, namely Focal-EIOU loss. Extensive experiments on both synthetic and real datasets are performed. Notable superiorities on both the convergence speed and the localization accuracy can be achieved over other BBR losses.

preprint2022arXiv

Generalizable Person Re-Identification via Self-Supervised Batch Norm Test-Time Adaption

In this paper, we investigate the generalization problem of person re-identification (re-id), whose major challenge is the distribution shift on an unseen domain. As an important tool of regularizing the distribution, batch normalization (BN) has been widely used in existing methods. However, they neglect that BN is severely biased to the training domain and inevitably suffers the performance drop if directly generalized without being updated. To tackle this issue, we propose Batch Norm Test-time Adaption (BNTA), a novel re-id framework that applies the self-supervised strategy to update BN parameters adaptively. Specifically, BNTA quickly explores the domain-aware information within unlabeled target data before inference, and accordingly modulates the feature distribution normalized by BN to adapt to the target domain. This is accomplished by two designed self-supervised auxiliary tasks, namely part positioning and part nearest neighbor matching, which help the model mine the domain-aware information with respect to the structure and identity of body parts, respectively. To demonstrate the effectiveness of our method, we conduct extensive experiments on three re-id datasets and confirm the superior performance to the state-of-the-art methods.

preprint2022arXiv

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

Sound source localization in visual scenes aims to localize objects emitting the sound in a given image. Recent works showing impressive localization performance typically rely on the contrastive learning framework. However, the random sampling of negatives, as commonly adopted in these methods, can result in misalignment between audio and visual features and thus inducing ambiguity in localization. In this paper, instead of following previous literature, we propose Self-Supervised Predictive Learning (SSPL), a negative-free method for sound localization via explicit positive mining. Specifically, we first devise a three-stream network to elegantly associate sound source with two augmented views of one corresponding video frame, leading to semantically coherent similarities between audio and visual features. Second, we introduce a novel predictive coding module for audio-visual feature alignment. Such a module assists SSPL to focus on target objects in a progressive manner and effectively lowers the positive-pair learning difficulty. Experiments show surprising results that SSPL outperforms the state-of-the-art approach on two standard sound localization benchmarks. In particular, SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best. Code is available at: https://github.com/zjsong/SSPL.

preprint2020arXiv

Adversarial Self-Supervised Learning for Semi-Supervised 3D Action Recognition

We consider the problem of semi-supervised 3D action recognition which has been rarely explored before. Its major challenge lies in how to effectively learn motion representations from unlabeled data. Self-supervised learning (SSL) has been proved very effective at learning representations from unlabeled data in the image domain. However, few effective self-supervised approaches exist for 3D action recognition, and directly applying SSL for semi-supervised learning suffers from misalignment of representations learned from SSL and supervised learning tasks. To address these issues, we present Adversarial Self-Supervised Learning (ASSL), a novel framework that tightly couples SSL and the semi-supervised scheme via neighbor relation exploration and adversarial learning. Specifically, we design an effective SSL scheme to improve the discrimination capability of learned representations for 3D action recognition, through exploring the data relations within a neighborhood. We further propose an adversarial regularization to align the feature distributions of labeled and unlabeled samples. To demonstrate effectiveness of the proposed ASSL in semi-supervised 3D action recognition, we conduct extensive experiments on NTU and N-UCLA datasets. The results confirm its advantageous performance over state-of-the-art semi-supervised methods in the few label regime for 3D action recognition.

preprint2020arXiv

CIAN: Cross-Image Affinity Net for Weakly Supervised Semantic Segmentation

Weakly supervised semantic segmentation with only image-level labels saves large human effort to annotate pixel-level labels. Cutting-edge approaches rely on various innovative constraints and heuristic rules to generate the masks for every single image. Although great progress has been achieved by these methods, they treat each image independently and do not take account of the relationships across different images. In this paper, however, we argue that the cross-image relationship is vital for weakly supervised segmentation. Because it connects related regions across images, where supplementary representations can be propagated to obtain more consistent and integral regions. To leverage this information, we propose an end-to-end cross-image affinity module, which exploits pixel-level cross-image relationships with only image-level labels. By means of this, our approach achieves 64.3% and 65.3% mIoU on Pascal VOC 2012 validation and test set respectively, which is a new state-of-the-art result by only using image-level labels for weakly supervised semantic segmentation, demonstrating the superiority of our approach.

preprint2020arXiv

Cosmetic-Aware Makeup Cleanser

Face verification aims at determining whether a pair of face images belongs to the same identity. Recent studies have revealed the negative impact of facial makeup on the verification performance. With the rapid development of deep generative models, this paper proposes a semanticaware makeup cleanser (SAMC) to remove facial makeup under different poses and expressions and achieve verification via generation. The intuition lies in the fact that makeup is a combined effect of multiple cosmetics and tailored treatments should be imposed on different cosmetic regions. To this end, we present both unsupervised and supervised semantic-aware learning strategies in SAMC. At image level, an unsupervised attention module is jointly learned with the generator to locate cosmetic regions and estimate the degree. At feature level, we resort to the effort of face parsing merely in training phase and design a localized texture loss to serve complements and pursue superior synthetic quality. The experimental results on four makeuprelated datasets verify that SAMC not only produces appealing de-makeup outputs at a resolution of 256*256, but also facilitates makeup-invariant face verification through image generation.

preprint2020arXiv

Dynamic Graph Representation for Partially Occluded Biometrics

The generalization ability of Convolutional neural networks (CNNs) for biometrics drops greatly due to the adverse effects of various occlusions. To this end, we propose a novel unified framework integrated the merits of both CNNs and graphical models to learn dynamic graph representations for occlusion problems in biometrics, called Dynamic Graph Representation (DGR). Convolutional features onto certain regions are re-crafted by a graph generator to establish the connections among the spatial parts of biometrics and build Feature Graphs based on these node representations. Each node of Feature Graphs corresponds to a specific part of the input image and the edges express the spatial relationships between parts. By analyzing the similarities between the nodes, the framework is able to adaptively remove the nodes representing the occluded parts. During dynamic graph matching, we propose a novel strategy to measure the distances of both nodes and adjacent matrixes. In this way, the proposed method is more convincing than CNNs-based methods because the dynamic graph method implies a more illustrative and reasonable inference of the biometrics decision. Experiments conducted on iris and face demonstrate the superiority of the proposed framework, which boosts the accuracy of occluded biometrics recognition by a large margin comparing with baseline methods.The code is avaliable at https://github.com/RenMin1991/Dyamic\_Graph\_Representation

preprint2020arXiv

Large-Scale Object Detection in the Wild from Imbalanced Multi-Labels

Training with more data has always been the most stable and effective way of improving performance in deep learning era. As the largest object detection dataset so far, Open Images brings great opportunities and challenges for object detection in general and sophisticated scenarios. However, owing to its semi-automatic collecting and labeling pipeline to deal with the huge data scale, Open Images dataset suffers from label-related problems that objects may explicitly or implicitly have multiple labels and the label distribution is extremely imbalanced. In this work, we quantitatively analyze these label problems and provide a simple but effective solution. We design a concurrent softmax to handle the multi-label problems in object detection and propose a soft-sampling methods with hybrid training scheduler to deal with the label imbalance. Overall, our method yields a dramatic improvement of 3.34 points, leading to the best single model with 60.90 mAP on the public object detection test set of Open Images. And our ensembling result achieves 67.17 mAP, which is 4.29 points higher than the best result of Open Images public test 2018.

preprint2020arXiv

TAGNN: Target Attentive Graph Neural Networks for Session-based Recommendation

Session-based recommendation nowadays plays a vital role in many websites, which aims to predict users' actions based on anonymous sessions. There have emerged many studies that model a session as a sequence or a graph via investigating temporal transitions of items in a session. However, these methods compress a session into one fixed representation vector without considering the target items to be predicted. The fixed vector will restrict the representation ability of the recommender model, considering the diversity of target items and users' interests. In this paper, we propose a novel target attentive graph neural network (TAGNN) model for session-based recommendation. In TAGNN, target-aware attention adaptively activates different user interests with respect to varied target items. The learned interest representation vector varies with different target items, greatly improving the expressiveness of the model. Moreover, TAGNN harnesses the power of graph neural networks to capture rich item transitions in sessions. Comprehensive experiments conducted on real-world datasets demonstrate its superiority over state-of-the-art methods.

preprint2020arXiv

TFNet: Multi-Semantic Feature Interaction for CTR Prediction

The CTR (Click-Through Rate) prediction plays a central role in the domain of computational advertising and recommender systems. There exists several kinds of methods proposed in this field, such as Logistic Regression (LR), Factorization Machines (FM) and deep learning based methods like Wide&Deep, Neural Factorization Machines (NFM) and DeepFM. However, such approaches generally use the vector-product of each pair of features, which have ignored the different semantic spaces of the feature interactions. In this paper, we propose a novel Tensor-based Feature interaction Network (TFNet) model, which introduces an operating tensor to elaborate feature interactions via multi-slice matrices in multiple semantic spaces. Extensive offline and online experiments show that TFNet: 1) outperforms the competitive compared methods on the typical Criteo and Avazu datasets; 2) achieves large improvement of revenue and click rate in online A/B tests in the largest Chinese App recommender system, Tencent MyApp.

preprint2016arXiv

DeMeshNet: Blind Face Inpainting for Deep MeshFace Verification

MeshFace photos have been widely used in many Chinese business organizations to protect ID face photos from being misused. The occlusions incurred by random meshes severely degenerate the performance of face verification systems, which raises the MeshFace verification problem between MeshFace and daily photos. Previous methods cast this problem as a typical low-level vision problem, i.e. blind inpainting. They recover perceptually pleasing clear ID photos from MeshFaces by enforcing pixel level similarity between the recovered ID images and the ground-truth clear ID images and then perform face verification on them. Essentially, face verification is conducted on a compact feature space rather than the image pixel space. Therefore, this paper argues that pixel level similarity and feature level similarity jointly offer the key to improve the verification performance. Based on this insight, we offer a novel feature oriented blind face inpainting framework. Specifically, we implement this by establishing a novel DeMeshNet, which consists of three parts. The first part addresses blind inpainting of the MeshFaces by implicitly exploiting extra supervision from the occlusion position to enforce pixel level similarity. The second part explicitly enforces a feature level similarity in the compact feature space, which can explore informative supervision from the feature space to produce better inpainting results for verification. The last part copes with face alignment within the net via a customized spatial transformer module when extracting deep facial features. All the three parts are implemented within an end-to-end network that facilitates efficient optimization. Extensive experiments on two MeshFace datasets demonstrate the effectiveness of the proposed DeMeshNet as well as the insight of this paper.

preprint2016arXiv

ICE: Information Credibility Evaluation on Social Media via Representation Learning

With the rapid growth of social media, rumors are also spreading widely on social media and bring harm to people's daily life. Nowadays, information credibility evaluation has drawn attention from academic and industrial communities. Current methods mainly focus on feature engineering and achieve some success. However, feature engineering based methods require a lot of labor and cannot fully reveal the underlying relations among data. In our viewpoint, the key elements of user behaviors for evaluating credibility are concluded as "who", "what", "when", and "how". These existing methods cannot model the correlation among different key elements during the spreading of microblogs. In this paper, we propose a novel representation learning method, Information Credibility Evaluation (ICE), to learn representations of information credibility on social media. In ICE, latent representations are learnt for modeling user credibility, behavior types, temporal properties, and comment attitudes. The aggregation of these factors in the microblog spreading process yields the representation of a user's behavior, and the aggregation of these dynamic representations generates the credibility representation of an event spreading on social media. Moreover, a pairwise learning method is applied to maximize the credibility difference between rumors and non-rumors. To evaluate the performance of ICE, we conduct experiments on a Sina Weibo data set, and the experimental results show that our ICE model outperforms the state-of-the-art methods.

preprint2016arXiv

Multimodal Memory Modelling for Video Captioning

Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is still a challenging problem. In this paper, we propose a Multimodal Memory Model (M3) to describe videos, which builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide global visual attention on described targets. Specifically, the proposed M3 attaches an external memory to store and retrieve both visual and textual contents by interacting with video and sentence with multiple read and write operations. First, text representation in the Long Short-Term Memory (LSTM) based text decoder is written into the memory, and the memory contents will be read out to guide an attention to select related visual targets. Then, the selected visual information is written into the memory, which will be further read out to the text decoder. To evaluate the proposed model, we perform experiments on two publicly benchmark datasets: MSVD and MSR-VTT. The experimental results demonstrate that our method outperforms the state-of-theart methods in terms of BLEU and METEOR.

preprint2016arXiv

What Is the Best Practice for CNNs Applied to Visual Instance Retrieval?

Previous work has shown that feature maps of deep convolutional neural networks (CNNs) can be interpreted as feature representation of a particular image region. Features aggregated from these feature maps have been exploited for image retrieval tasks and achieved state-of-the-art performances in recent years. The key to the success of such methods is the feature representation. However, the different factors that impact the effectiveness of features are still not explored thoroughly. There are much less discussion about the best combination of them. The main contribution of our paper is the thorough evaluations of the various factors that affect the discriminative ability of the features extracted from CNNs. Based on the evaluation results, we also identify the best choices for different factors and propose a new multi-scale image feature representation method to encode the image effectively. Finally, we show that the proposed method generalises well and outperforms the state-of-the-art methods on four typical datasets used for visual instance retrieval.

preprint2015arXiv

Deep Semantic Ranking Based Hashing for Multi-Label Image Retrieval

With the rapid growth of web images, hashing has received increasing interests in large scale image retrieval. Research efforts have been devoted to learning compact binary codes that preserve semantic similarity based on labels. However, most of these hashing methods are designed to handle simple binary similarity. The complex multilevel semantic structure of images associated with multiple labels have not yet been well explored. Here we propose a deep semantic ranking based method for learning hash functions that preserve multilevel semantic similarity between multi-label images. In our approach, deep convolutional neural network is incorporated into hash functions to jointly learn feature representations and mappings from them to hash codes, which avoids the limitation of semantic representation power of hand-crafted features. Meanwhile, a ranking list that encodes the multilevel similarity information is employed to guide the learning of such deep hash functions. An effective scheme based on surrogate loss is used to solve the intractable optimization problem of nonsmooth and multivariate ranking measures involved in the learning procedure. Experimental results show the superiority of our proposed approach over several state-of-the-art hashing methods in term of ranking evaluation metrics when tested on multi-label image datasets.

preprint2015arXiv

Learning Structured Ordinal Measures for Video based Face Recognition

This paper presents a structured ordinal measure method for video-based face recognition that simultaneously learns ordinal filters and structured ordinal features. The problem is posed as a non-convex integer program problem that includes two parts. The first part learns stable ordinal filters to project video data into a large-margin ordinal space. The second seeks self-correcting and discrete codes by balancing the projected data and a rank-one ordinal matrix in a structured low-rank way. Unsupervised and supervised structures are considered for the ordinal matrix. In addition, as a complement to hierarchical structures, deep feature representations are integrated into our method to enhance coding stability. An alternating minimization method is employed to handle the discrete and low-rank constraints, yielding high-quality codes that capture prior structures well. Experimental results on three commonly used face video databases show that our method with a simple voting classifier can achieve state-of-the-art recognition rates using fewer features and samples.

preprint2010arXiv

Translation-Invariant Representation for Cumulative Foot Pressure Images

Human can be distinguished by different limb movements and unique ground reaction force. Cumulative foot pressure image is a 2-D cumulative ground reaction force during one gait cycle. Although it contains pressure spatial distribution information and pressure temporal distribution information, it suffers from several problems including different shoes and noise, when putting it into practice as a new biometric for pedestrian identification. In this paper, we propose a hierarchical translation-invariant representation for cumulative foot pressure images, inspired by the success of Convolutional deep belief network for digital classification. Key contribution in our approach is discriminative hierarchical sparse coding scheme which helps to learn useful discriminative high-level visual features. Based on the feature representation of cumulative foot pressure images, we develop a pedestrian recognition system which is invariant to three different shoes and slight local shape change. Experiments are conducted on a proposed open dataset that contains more than 2800 cumulative foot pressure images from 118 subjects. Evaluations suggest the effectiveness of the proposed method and the potential of cumulative foot pressure images as a biometric.

Tieniu Tan

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Cross-Domain Cross-Set Few-Shot Learning via Learning Compact and Aligned Representations

Disentangled Federated Learning for Tackling Attributes Skew via Invariant Aggregation and Diversity Transferring

Focal and Efficient IOU Loss for Accurate Bounding Box Regression

Generalizable Person Re-Identification via Self-Supervised Batch Norm Test-Time Adaption

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

Adversarial Self-Supervised Learning for Semi-Supervised 3D Action Recognition

CIAN: Cross-Image Affinity Net for Weakly Supervised Semantic Segmentation

Cosmetic-Aware Makeup Cleanser

Dynamic Graph Representation for Partially Occluded Biometrics

Large-Scale Object Detection in the Wild from Imbalanced Multi-Labels

TAGNN: Target Attentive Graph Neural Networks for Session-based Recommendation

TFNet: Multi-Semantic Feature Interaction for CTR Prediction

DeMeshNet: Blind Face Inpainting for Deep MeshFace Verification

ICE: Information Credibility Evaluation on Social Media via Representation Learning

Multimodal Memory Modelling for Video Captioning

What Is the Best Practice for CNNs Applied to Visual Instance Retrieval?

Deep Semantic Ranking Based Hashing for Multi-Label Image Retrieval

Learning Structured Ordinal Measures for Video based Face Recognition

Translation-Invariant Representation for Cumulative Foot Pressure Images