Researcher profile

Zhi-Hong Deng

Zhi-Hong Deng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2022arXiv

Alleviating Representational Shift for Continual Fine-tuning

We study a practical setting of continual learning: fine-tuning on a pre-trained model continually. Previous work has found that, when training on new tasks, the features (penultimate layer representations) of previous data will change, called representational shift. Besides the shift of features, we reveal that the intermediate layers' representational shift (IRS) also matters since it disrupts batch normalization, which is another crucial cause of catastrophic forgetting. Motivated by this, we propose ConFiT, a fine-tuning method incorporating two components, cross-convolution batch normalization (Xconv BN) and hierarchical fine-tuning. Xconv BN maintains pre-convolution running means instead of post-convolution, and recovers post-convolution ones before testing, which corrects the inaccurate estimates of means under IRS. Hierarchical fine-tuning leverages a multi-stage strategy to fine-tune the pre-trained network, preventing massive changes in Conv layers and thus alleviating IRS. Experimental results on four datasets show that our method remarkably outperforms several state-of-the-art methods with lower storage overhead.

preprint2022arXiv

Bypassing Logits Bias in Online Class-Incremental Learning with a Generative Framework

Continual learning requires the model to maintain the learned knowledge while learning from a non-i.i.d data stream continually. Due to the single-pass training setting, online continual learning is very challenging, but it is closer to the real-world scenarios where quick adaptation to new data is appealing. In this paper, we focus on online class-incremental learning setting in which new classes emerge over time. Almost all existing methods are replay-based with a softmax classifier. However, the inherent logits bias problem in the softmax classifier is a main cause of catastrophic forgetting while existing solutions are not applicable for online settings. To bypass this problem, we abandon the softmax classifier and propose a novel generative framework based on the feature space. In our framework, a generative classifier which utilizes replay memory is used for inference, and the training objective is a pair-based metric learning loss which is proven theoretically to optimize the feature space in a generative way. In order to improve the ability to learn new data, we further propose a hybrid of generative and discriminative loss to train the model. Extensive experiments on several benchmarks, including newly introduced task-free datasets, show that our method beats a series of state-of-the-art replay-based methods with discriminative classifiers, and reduces catastrophic forgetting consistently with a remarkable margin.

preprint2022arXiv

Convolutional Bypasses Are Better Vision Transformer Adapters

The pretrain-then-finetune paradigm has been widely adopted in computer vision. But as the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead. Motivated by parameter-efficient transfer learning (PETL) on language transformers, recent studies attempt to insert lightweight adaptation modules (e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune these modules while the pretrained weights are frozen. However, these modules were originally proposed to finetune language models and did not take into account the prior knowledge specifically for visual tasks. In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt the large ViT. Different from other PETL methods, Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks, especially in the low-data regime. Experimental results on VTAB-1K benchmark and few-shot learning datasets show that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adaptation modules for adapting vision models.

preprint2022arXiv

G$^3$SR: Global Graph Guided Session-based Recommendation

Session-based recommendation tries to make use of anonymous session data to deliver high-quality recommendation under the condition that user-profiles and the complete historical behavioral data of a target user are unavailable. Previous works consider each session individually and try to capture user interests within a session. Despite their encouraging results, these models can only perceive intra-session items and cannot draw upon the massive historical relational information. To solve this problem, we propose a novel method named G$^3$SR (Global Graph Guided Session-based Recommendation). G$^3$SR decomposes the session-based recommendation workflow into two steps. First, a global graph is built upon all session data, from which the global item representations are learned in an unsupervised manner. Then, these representations are refined on session graphs under the graph networks, and a readout function is used to generate session representations for each session. Extensive experiments on two real-world benchmark datasets show remarkable and consistent improvements of the G$^3$SR method over the state-of-the-art methods, especially for cold items.

preprint2022arXiv

Rethinking Minimal Sufficient Representation in Contrastive Learning

Contrastive learning between different views of the data achieves outstanding success in the field of self-supervised representation learning and the learned representations are useful in broad downstream tasks. Since all supervision information for one view comes from the other view, contrastive learning approximately obtains the minimal sufficient representation which contains the shared information and eliminates the non-shared information between views. Considering the diversity of the downstream tasks, it cannot be guaranteed that all task-relevant information is shared between views. Therefore, we assume the non-shared task-relevant information cannot be ignored and theoretically prove that the minimal sufficient representation in contrastive learning is not sufficient for the downstream tasks, which causes performance degradation. This reveals a new problem that the contrastive learning models have the risk of over-fitting to the shared information between views. To alleviate this problem, we propose to increase the mutual information between the representation and input as regularization to approximately introduce more task-relevant information, since we cannot utilize any downstream task information during training. Extensive experiments verify the rationality of our analysis and the effectiveness of our method. It significantly improves the performance of several classic contrastive learning models in downstream tasks. Our code is available at https://github.com/Haoqing-Wang/InfoCL.

preprint2020arXiv

Dynamically Pruned Message Passing Networks for Large-Scale Knowledge Graph Reasoning

We propose Dynamically Pruned Message Passing Networks (DPMPN) for large-scale knowledge graph reasoning. In contrast to existing models, embedding-based or path-based, we learn an input-dependent subgraph to explicitly model reasoning process. Subgraphs are dynamically constructed and expanded by applying graphical attention mechanism conditioned on input queries. In this way, we not only construct graph-structured explanations but also enable message passing designed in Graph Neural Networks (GNNs) to scale with graph sizes. We take the inspiration from the consciousness prior proposed by and develop a two-GNN framework to simultaneously encode input-agnostic full graph representation and learn input-dependent local one coordinated by an attention module. Experiments demonstrate the reasoning capability of our model that is to provide clear graphical explanations as well as deliver accurate predictions, outperforming most state-of-the-art methods in knowledge base completion tasks.

preprint2020arXiv

Fast Structured Decoding for Sequence Models

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to reduce the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts. To improve then decoding consistency and reduce the inference cost at the same time, we propose to incorporate a structured inference module into the non-autoregressive models. Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. Experiments in machine translation show that while increasing little latency (8~14ms), our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.

preprint2020arXiv

Few-shot Learning with LSSVM Base Learner and Transductive Modules

The performance of meta-learning approaches for few-shot learning generally depends on three aspects: features suitable for comparison, the classifier ( base learner ) suitable for low-data scenarios, and valuable information from the samples to classify. In this work, we make improvements for the last two aspects: 1) although there are many effective base learners, there is a trade-off between generalization performance and computational overhead, so we introduce multi-class least squares support vector machine as our base learner which obtains better generation than existing ones with less computational overhead; 2) further, in order to utilize the information from the query samples, we propose two simple and effective transductive modules which modify the support set using the query samples, i.e., adjusting the support samples basing on the attention mechanism and adding the prototypes of the query set with pseudo labels to the support set as the pseudo support samples. These two modules significantly improve the few-shot classification accuracy, especially for the difficult 1-shot setting. Our model, denoted as FSLSTM (Few-Shot learning with LSsvm base learner and Transductive Modules), achieves state-of-the-art performance on miniImageNet and CIFAR-FS few-shot learning benchmarks.

preprint2020arXiv

Generative Feature Replay with Orthogonal Weight Modification for Continual Learning

The ability of intelligent agents to learn and remember multiple tasks sequentially is crucial to achieving artificial general intelligence. Many continual learning (CL) methods have been proposed to overcome catastrophic forgetting which results from non i.i.d data in the sequential learning of neural networks. In this paper we focus on class incremental learning, a challenging CL scenario. For this scenario, generative replay is a promising strategy which generates and replays pseudo data for previous tasks to alleviate catastrophic forgetting. However, it is hard to train a generative model continually for relatively complex data. Based on recently proposed orthogonal weight modification (OWM) algorithm which can approximately keep previously learned feature invariant when learning new tasks, we propose to 1) replay penultimate layer feature with a generative model; 2) leverage a self-supervised auxiliary task to further enhance the stability of feature. Empirical results on several datasets show our method always achieves substantial improvement over powerful OWM while conventional generative replay always results in a negative effect. Meanwhile our method beats several strong baselines including one based on real data storage. In addition, we conduct experiments to study why our method is effective.