Source author record

Kai Shuang

Kai Shuang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision eess.IV Artificial Intelligence Computation and Language Machine Learning

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Beyond the Granularity: Multi-Perspective Dialogue Collaborative Selection for Dialogue State Tracking

In dialogue state tracking, dialogue history is a crucial material, and its utilization varies between different models. However, no matter how the dialogue history is used, each existing model uses its own consistent dialogue history during the entire state tracking process, regardless of which slot is updated. Apparently, it requires different dialogue history to update different slots in different turns. Therefore, using consistent dialogue contents may lead to insufficient or redundant information for different slots, which affects the overall performance. To address this problem, we devise DiCoS-DST to dynamically select the relevant dialogue contents corresponding to each slot for state updating. Specifically, it first retrieves turn-level utterances of dialogue history and evaluates their relevance to the slot from a combination of three perspectives: (1) its explicit connection to the slot name; (2) its relevance to the current turn dialogue; (3) Implicit Mention Oriented Reasoning. Then these perspectives are combined to yield a decision, and only the selected dialogue contents are fed into State Generator, which explicitly minimizes the distracting information passed to the downstream state prediction. Experimental results show that our approach achieves new state-of-the-art performance on MultiWOZ 2.1 and MultiWOZ 2.2, and achieves superior performance on multiple mainstream benchmark datasets (including Sim-M, Sim-R, and DSTC2).

preprint2020arXiv

A Hierarchical User Intention-Habit Extract Network for Credit Loan Overdue Risk Detection

More personal consumer loan products are emerging in mobile banking APP. For ease of use, application process is always simple, which means that few application information is requested for user to fill when applying for a loan, which is not conducive to construct users' credit profile. Thus, the simple application process brings huge challenges to the overdue risk detection, as higher overdue rate will result in greater economic losses to the bank. In this paper, we propose a model named HUIHEN (Hierarchical User Intention-Habit Extract Network) that leverages the users' behavior information in mobile banking APP. Due to the diversity of users' behaviors, we divide behavior sequences into sessions according to the time interval, and use the field-aware method to extract the intra-field information of behaviors. Then, we propose a hierarchical network composed of time-aware GRU and user-item-aware GRU to capture users' short-term intentions and users' long-term habits, which can be regarded as a supplement to user profile. The proposed model can improve the accuracy without increasing the complexity of the original online application process. Experimental results demonstrate the superiority of HUIHEN and show that HUIHEN outperforms other state-of-art models on all datasets.

preprint2020arXiv

Contrastive Visual-Linguistic Pretraining

Several multi-modality representation learning approaches such as LXMERT and ViLBERT have been proposed recently. Such approaches can achieve superior performance due to the high-level semantic information captured during large-scale multimodal pretraining. However, as ViLBERT and LXMERT adopt visual region regression and classification loss, they often suffer from domain gap and noisy label problems, based on the visual features having been pretrained on the Visual Genome dataset. To overcome these issues, we propose unbiased Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual self-supervised loss built upon contrastive learning. We evaluate CVLP on several down-stream tasks, including VQA, GQA and NLVR2 to validate the superiority of contrastive learning on multi-modality representation learning. Our code is available at: https://github.com/ArcherYunDong/CVLP-.

preprint2020arXiv

Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering

Multi-modality fusion technologies have greatly improved the performance of neural network-based Video Description/Caption, Visual Question Answering (VQA) and Audio Visual Scene-aware Dialog (AVSD) over the recent years. Most previous approaches only explore the last layers of multiple layer feature fusion while omitting the importance of intermediate layers. To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Network (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously. In our proposed QBN, we use the holistic text features to guide the update of visual features. In the meantime, Hamilton quaternion products can efficiently perform information flow from higher layers to lower layers for both visual and text modalities. The evaluation results show our QBN improved the performance on VQA 2.0, even though using surpass large scale BERT or visual BERT pre-trained models. Extensive ablation study has been carried out to testify the influence of each proposed module in this study.

Kai Shuang

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Beyond the Granularity: Multi-Perspective Dialogue Collaborative Selection for Dialogue State Tracking

A Hierarchical User Intention-Habit Extract Network for Credit Loan Overdue Risk Detection

Contrastive Visual-Linguistic Pretraining

Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering