Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
37works
0followers
19topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

37 published item(s)

preprint2026arXiv

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.

preprint2024arXiv

WavMark: Watermarking for Audio Generation

Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.

preprint2023arXiv

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

preprint2022arXiv

A collaborative decomposition-based evolutionary algorithm integrating normal and penalty-based boundary intersection for many-objective optimization

Decomposition-based evolutionary algorithms have become fairly popular for many-objective optimization in recent years. However, the existing decomposition methods still are quite sensitive to the various shapes of frontiers of many-objective optimization problems (MaOPs). On the one hand, the cone decomposition methods such as the penalty-based boundary intersection (PBI) are incapable of acquiring uniform frontiers for MaOPs with very convex frontiers. On the other hand, the parallel reference lines of the parallel decomposition methods including the normal boundary intersection (NBI) might result in poor diversity because of under-sampling near the boundaries for MaOPs with concave frontiers. In this paper, a collaborative decomposition method is first proposed to integrate the advantages of parallel decomposition and cone decomposition to overcome their respective disadvantages. This method inherits the NBI-style Tchebycheff function as a convergence measure to heighten the convergence and uniformity of distribution of the PBI method. Moreover, this method also adaptively tunes the extent of rotating an NBI reference line towards a PBI reference line for every subproblem to enhance the diversity of distribution of the NBI method. Furthermore, a collaborative decomposition-based evolutionary algorithm (CoDEA) is presented for many-objective optimization. A collaborative decomposition-based environmental selection mechanism is primarily designed in CoDEA to rank all the individuals associated with the same PBI reference line in the boundary layer and pick out the best ranks. CoDEA is compared with several popular algorithms on 85 benchmark test instances. The experimental results show that CoDEA achieves high competitiveness benefiting from the collaborative decomposition maintaining a good balance among the convergence, uniformity, and diversity of distribution.

preprint2022arXiv

Generative Graph Neural Networks for Link Prediction

Inferring missing links or detecting spurious ones based on observed graphs, known as link prediction, is a long-standing challenge in graph data analysis. With the recent advances in deep learning, graph neural networks have been used for link prediction and have achieved state-of-the-art performance. Nevertheless, existing methods developed for this purpose are typically discriminative, computing features of local subgraphs around two neighboring nodes and predicting potential links between them from the perspective of subgraph classification. In this formalism, the selection of enclosing subgraphs and heuristic structural features for subgraph classification significantly affects the performance of the methods. To overcome this limitation, this paper proposes a novel and radically different link prediction algorithm based on the network reconstruction theory, called GraphLP. Instead of sampling positive and negative links and heuristically computing the features of their enclosing subgraphs, GraphLP utilizes the feature learning ability of deep-learning models to automatically extract the structural patterns of graphs for link prediction under the assumption that real-world graphs are not locally isolated. Moreover, GraphLP explores high-order connectivity patterns to utilize the hierarchical organizational structures of graphs for link prediction. Our experimental results on all common benchmark datasets from different applications demonstrate that the proposed method consistently outperforms other state-of-the-art methods. Unlike the discriminative neural network models used for link prediction, GraphLP is generative, which provides a new paradigm for neural-network-based link prediction.

preprint2022arXiv

How to Share: Balancing Layer and Chain Sharing in Industrial Microservice Deployment

With the rapid development of smart manufacturing, edge computing-oriented microservice platforms are emerging as an important part of production control. In the containerized deployment of microservices, layer sharing can reduce the huge bandwidth consumption caused by image pulling, and chain sharing can reduce communication overhead caused by communication between microservices. The two sharing methods use the characteristics of each microservice to share resources during deployment. However, due to the limited resources of edge servers, it is difficult to meet the optimization goals of the two methods at the same time. Therefore, it is of critical importance to realize the improvement of service response efficiency by balancing the two sharing methods. This paper studies the optimal microservice deployment strategy that can balance layer sharing and chain sharing of microservices. We build a problem that minimizes microservice image pull delay and communication overhead and transform the problem into a linearly constrained integer quadratic programming problem through model reconstruction. A deployment strategy is obtained through the successive convex approximation (SCA) method. Experimental results show that the proposed deployment strategy can balance the two resource sharing methods. When the two sharing methods are equally considered, the average image pull delay can be reduced to 65% of the baseline, and the average communication overhead can be reduced to 30% of the baseline.

preprint2022arXiv

Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.

preprint2022arXiv

Multi-Query Video Retrieval

Retrieving target videos based on text descriptions is a task of great practical value and has received increasing attention over the past few years. Despite recent progress, imperfect annotations in existing video retrieval datasets have posed significant challenges on model evaluation and development. In this paper, we tackle this issue by focusing on the less-studied setting of multi-query video retrieval, where multiple descriptions are provided to the model for searching over the video archive. We first show that multi-query retrieval task effectively mitigates the dataset noise introduced by imperfect annotations and better correlates with human judgement on evaluating retrieval abilities of current models. We then investigate several methods which leverage multiple queries at training time, and demonstrate that the multi-query inspired training can lead to superior performance and better generalization. We hope further investigation in this direction can bring new insights on building systems that perform better in real-world video retrieval applications.

preprint2022arXiv

Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization

Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde O(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic network demonstrate the efficiency of our algorithm.

preprint2022arXiv

Quantized GAN for Complex Music Generation from Dance Videos

We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates complex musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motions as input, and learns to generate music samples that plausibly accompany the corresponding input. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that usually rely on pre-defined musical synthesizers, in this work we generate dance music in complex styles (e.g., pop, breaking, etc.) by employing a Vector Quantized (VQ) audio representation, and leverage both its generality and high abstraction capacity of its symbolic and continuous counterparts. By performing an extensive set of experiments on multiple datasets, and following a comprehensive evaluation protocol, we assess the generative qualities of our proposal against alternatives. The attained quantitative results, which measure the music consistency, beats correspondence, and music diversity, demonstrate the effectiveness of our proposed method. Last but not least, we curate a challenging dance-music dataset of in-the-wild TikTok videos, which we use to further demonstrate the efficacy of our approach in real-world applications -- and which we hope to serve as a starting point for relevant future research.

preprint2022arXiv

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.

preprint2022arXiv

Speech Pre-training with Acoustic Piece

Previous speech pre-training methods, such as wav2vec2.0 and HuBERT, pre-train a Transformer encoder to learn deep representations from audio data, with objectives predicting either elements from latent vector quantized space or pre-generated labels (known as target codes) with offline clustering. However, those training signals (quantized elements or codes) are independent across different tokens without considering their relations. According to our observation and analysis, the target codes share obvious patterns aligned with phonemized text data. Based on that, we propose to leverage those patterns to better pre-train the model considering the relations among the codes. The patterns we extracted, called "acoustic piece"s, are from the sentence piece result of HuBERT codes. With the acoustic piece as the training signal, we can implicitly bridge the input audio and natural language, which benefits audio-to-text tasks, such as automatic speech recognition (ASR). Simple but effective, our method "HuBERT-AP" significantly outperforms strong baselines on the LibriSpeech ASR task.

preprint2022arXiv

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.

preprint2022arXiv

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

preprint2022arXiv

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

preprint2022arXiv

Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training

Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to interpret. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance and also the pre-training efficiency, either through decoding with a hybrid ASR system to generate phoneme-level alignments (named PBERT), or performing clustering on the supervised speech features extracted from an end-to-end CTC model (named CTC clustering). Both the hybrid and CTC models are trained on the same small amount of labeled speech as used in fine-tuning. Experiments demonstrate significant superiority of our methods to various SSL and self-training baselines, with up to 17.0% relative WER reduction. Our pre-trained models also show good transferability in a non-ASR speech task.

preprint2022arXiv

Ultra Fast Speech Separation Model with Teacher Student Learning

Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder layers, which hinders its deployment on edge devices. A small Transformer model with fewer encoder layers is preferred for computational efficiency, but it is prone to performance degradation. In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning). We introduce layer-wise T-S learning and objective shifting mechanisms to guide the small student model to learn intermediate representations from the large teacher model. Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation on LibriCSS dataset. Utilizing more unlabeled speech data, our ultra fast speech separation models achieve more than 10% relative WER reduction.

preprint2022arXiv

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness. Our experiments on synthesized and real noisy data show the effectiveness of our method: it achieves 2.9--4.9% relative word error rate (WER) reduction on the synthesized noisy LibriSpeech data without deterioration on the original data, and 5.7% on CHiME-4 real 1-channel noisy data compared to a data augmentation baseline even with a strong language model for decoding. Our results on CHiME-4 can match or even surpass those with well-designed speech enhancement components.

preprint2022arXiv

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

preprint2021arXiv

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.

preprint2021arXiv

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training.

preprint2021arXiv

Learning to Anticipate Egocentric Actions by Imagination

Anticipating actions before they are executed is crucial for a wide range of practical applications, including autonomous driving and robotics. In this paper, we study the egocentric action anticipation task, which predicts future action seconds before it is performed for egocentric videos. Previous approaches focus on summarizing the observed content and directly predicting future action based on past observations. We believe it would benefit the action anticipation if we could mine some cues to compensate for the missing information of the unobserved frames. We then propose to decompose the action anticipation into a series of future feature predictions. We imagine how the visual feature changes in the near future and then predicts future action labels based on these imagined representations. Differently, our ImagineRNN is optimized in a contrastive learning way instead of feature regression. We utilize a proxy task to train the ImagineRNN, i.e., selecting the correct future states from distractors. We further improve ImagineRNN by residual anticipation, i.e., changing its target to predicting the feature difference of adjacent frames instead of the frame content. This promotes the network to focus on our target, i.e., the future action, as the difference between adjacent frame features is more important for forecasting the future. Extensive experiments on two large-scale egocentric action datasets validate the effectiveness of our method. Our method significantly outperforms previous methods on both the seen test set and the unseen test set of the EPIC Kitchens Action Anticipation Challenge.

preprint2021arXiv

Virtual Reality Sickness Mitigation Methods: A Comparative Study in a Racing Game

Using virtual reality (VR) head-mounted displays (HMDs) can induce VR sickness. VR sickness can cause strong discomfort, decrease users' presence and enjoyment, especially in games, shorten the duration of the VR experience, and can even pose health risks. Previous research has explored different VR sickness mitigation methods by adding visual effects or elements. Field of View (FOV) reduction, Depth of Field (DOF) blurring, and adding a rest frame into the virtual environment are examples of such methods. Although useful in some cases, they might result in information loss. This research is the first to compare VR sickness, presence, workload to complete a search task, and information loss of these three VR sickness mitigation methods in a racing game with two levels of control. To do this, we conducted a mixed factorial user study (N = 32) with degree of control as the between-subjects factor and the VR sickness mitigation techniques as the within-subjects factor. Participants were required to find targets with three difficulty levels while steering or not steering a car in a virtual environment. Our results show that there are no significant differences in VR sickness, presence and workload among these techniques under two levels of control in our VR racing game. We also found that changing FOV dynamically or using DOF blur effects would result in information loss while adding a target reticule as a rest frame would not.

preprint2020arXiv

Abnormal activity capture from passenger flow of elevator based on unsupervised learning and fine-grained multi-label recognition

We present a work-flow which aims at capturing residents' abnormal activities through the passenger flow of elevator in multi-storey residence buildings. Camera and sensors (hall sensor, photoelectric sensor, gyro, accelerometer, barometer, and thermometer) with internet connection are mounted in elevator to collect image and data. Computer vision algorithms such as instance segmentation, multi-label recognition, embedding and clustering are applied to generalize passenger flow of elevator, i.e. how many people and what kinds of people get in and out of the elevator on each floor. More specifically in our implementation we propose GraftNet, a solution for fine-grained multi-label recognition task, to recognize human attributes, e.g. gender, age, appearance, and occupation. Then anomaly detection of unsupervised learning is hierarchically applied on the passenger flow data to capture abnormal or even illegal activities of the residents which probably bring safety hazard, e.g. drug dealing, pyramid sale gathering, prostitution, and over crowded residence. Experiment shows effects are there, and the captured records will be directly reported to our customer(property managers) for further confirmation.

preprint2020arXiv

Curriculum Pre-training for End-to-End Speech Translation

End-to-end speech translation poses a heavy burden on the encoder, because it has to transcribe, understand, and learn cross-lingual semantics simultaneously. To obtain a powerful encoder, traditional methods pre-train it on ASR data to capture speech features. However, we argue that pre-training the encoder only through simple speech recognition is not enough and high-level linguistic knowledge should be considered. Inspired by this, we propose a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages. The difficulty of these courses is gradually increasing. Experiments show that our curriculum pre-training method leads to significant improvements on En-De and En-Fr speech translation benchmarks.

preprint2020arXiv

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog.

preprint2020arXiv

Effects of intervalley scatterings in thermoelectric performance of band-convergent antimonene

The strategy of band convergence of multi-valley conduction bands or multi-peak valence bands has been widely used to search or improve thermoelectric materials. However, the phonon-assisted intervalley scatterings due to multiple band degeneracy are usually neglected in the thermoelectric community. In this work, we investigate the (thermo)electric properties of non-polar monolayer $β$- and $α$-antimonene considering full mode- and momentum-resolved electron-phonon interactions. We also analyze thoroughly the selection rules on electron-phonon matrix-elements using group-theory arguments. Our calculations reveal strong intervalley scattering between the nearly degenerate valley states in both $β$- and $α$-antimonene, and the commonly-used deformation potential approximation neglecting the dominant intervalley scattering gives inaccurate estimations of the electron-phonon scattering and thermoelectric transport properties. By considering full electron-phonon interactions based on the rigid-band approximation, we find that, the maximum value of the thermoelectric figure of merits $zT$ at room temperature reduces to 0.37 in $β$-antimonene, by a factor of 5.7 comparing to the value predicted based on the constant relaxation-time approximation method. Our work not only provides an accurate prediction of the thermoelectric performances of antimonenes that reveals the key role of intervalley scatterings in determining the electronic part of zT, but also showcases a computational framework for thermoelectric materials.

preprint2020arXiv

Gated Channel Transformation for Visual Recognition

In this work, we propose a generally applicable transformation unit for visual recognition with deep convolutional neural networks. This transformation explicitly models channel relationships with explainable control variables. These variables determine the neuron behaviors of competition or cooperation, and they are jointly optimized with the convolutional weight towards more accurate recognition. In Squeeze-and-Excitation (SE) Networks, the channel relationships are implicitly learned by fully connected layers, and the SE block is integrated at the block-level. We instead introduce a channel normalization layer to reduce the number of parameters and computational complexity. This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters. Extensive experiments demonstrate the effectiveness of our unit with clear margins on many vision tasks, i.e., image classification on ImageNet, object detection and instance segmentation on COCO, video classification on Kinetics.

preprint2020arXiv

Low Latency End-to-End Streaming Speech Recognition with a Scout Network

The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode. However, in the streaming mode, the Transformer model usually incurs significant latency to maintain its recognition accuracy when applying a fixed-length look-ahead window in each encoder layer. In this paper, we propose a novel low-latency streaming approach for Transformer models, which consists of a scout network and a recognition network. The scout network detects the whole word boundary without seeing any future frames, while the recognition network predicts the next subword by utilizing the information from all the frames before the predicted boundary. Our model achieves the best performance (2.7/6.4 WER) with only 639 ms latency on the test-clean and test-other data sets of Librispeech.

preprint2020arXiv

MuTual: A Dataset for Multi-Turn Dialogue Reasoning

Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques. Given a context, current systems are able to yield a relevant and fluent response, but sometimes make logical mistakes because of weak reasoning capabilities. To facilitate the conversation reasoning research, we introduce MuTual, a novel dataset for Multi-Turn dialogue Reasoning, consisting of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. Compared to previous benchmarks for non-task oriented dialogue systems, MuTual is much more challenging since it requires a model that can handle various reasoning problems. Empirical results show that state-of-the-art methods only reach 71%, which is far behind the human performance of 94%, indicating that there is ample room for improving reasoning ability. MuTual is available at https://github.com/Nealcly/MuTual.

preprint2020arXiv

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

preprint2020arXiv

Revisiting EmbodiedQA: A Simple Baseline and Beyond

In Embodied Question Answering (EmbodiedQA), an agent interacts with an environment to gather necessary information for answering user questions. Existing works have laid a solid foundation towards solving this interesting problem. But the current performance, especially in navigation, suggests that EmbodiedQA might be too challenging for the contemporary approaches. In this paper, we empirically study this problem and introduce 1) a simple yet effective baseline that achieves promising performance; 2) an easier and practical setting for EmbodiedQA where an agent has a chance to adapt the trained model to a new environment before it actually answers users questions. In this new setting, we randomly place a few objects in new environments, and upgrade the agent policy by a distillation network to retain the generalization ability from the trained model. On the EmbodiedQA v1 benchmark, under the standard setting, our simple baseline achieves very competitive results to the-state-of-the-art; in the new setting, we found the introduced small change in settings yields a notable gain in navigation.

preprint2020arXiv

Semantic Mask for Transformer based End-to-End Speech Recognition

Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. This approach takes advantage of the memorization capacity of neural networks to learn the mapping from the input sequence to the output sequence from scratch, without the assumption of prior knowledge such as the alignments. However, this model is prone to overfitting, especially when the amount of training data is limited. Inspired by SpecAugment and BERT, in this paper, we propose a semantic mask based regularization for training such kind of end-to-end (E2E) model. The idea is to mask the input features corresponding to a particular output token, e.g., a word or a word-piece, in order to encourage the model to fill the token based on the contextual information. While this approach is applicable to the encoder-decoder framework with any type of neural network architecture, we study the transformer-based model for ASR in this work. We perform experiments on Librispeech 960h and TedLium2 data sets, and achieve the state-of-the-art performance on the test set in the scope of E2E models.

preprint2020arXiv

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Egocentric video recognition is a natural testbed for diverse interaction reasoning. Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification. However, correlation studies between the verb and the noun branches have been largely ignored. Besides, the two branches fail to exploit local features due to the absence of a position-aware attention mechanism. In this paper, we propose a novel Symbiotic Attention framework leveraging Privileged information (SAP) for egocentric video recognition. Finer position-aware object detection features can facilitate the understanding of actor's interaction with the object. We introduce these features in action recognition and regard them as privileged information. Our framework enables mutual communication among the verb branch, the noun branch, and the privileged information. This communication process not only injects local details into global features but also exploits implicit guidance about the spatio-temporal position of an on-going action. We introduce novel symbiotic attention (SA) to enable effective communication. It first normalizes the detection guided features on one branch to underline the action-relevant information from the other branch. SA adaptively enhances the interactions among the three sources. To further catalyze this communication, spatial relations are uncovered for the selection of most action-relevant information. It identifies the most valuable and discriminative feature for classification. We validate the effectiveness of our SAP quantitatively and qualitatively. Notably, it achieves the state-of-the-art on two large-scale egocentric video datasets.

preprint2020arXiv

Unsupervised Person Re-identification via Softened Similarity Learning

Person re-identification (re-ID) is an important topic in computer vision. This paper studies the unsupervised setting of re-ID, which does not require any labeled information and thus is freely deployed to new scenarios. There are very few studies under this setting, and one of the best approach till now used iterative clustering and classification, so that unlabeled images are clustered into pseudo classes for a classifier to get trained, and the updated features are used for clustering and so on. This approach suffers two problems, namely, the difficulty of determining the number of clusters, and the hard quantization loss in clustering. In this paper, we follow the iterative training mechanism but discard clustering, since it incurs loss from hard quantization, yet its only product, image-level similarity, can be easily replaced by pairwise computation and a softened classification task. With these improvements, our approach becomes more elegant and is more robust to hyper-parameter changes. Experiments on two image-based and video-based datasets demonstrate state-of-the-art performance under the unsupervised re-ID setting.

preprint2019arXiv

Cascaded Revision Network for Novel Object Captioning

Image captioning, a challenging task where the machine automatically describes an image by sentences, has drawn significant attention in recent years. Despite the remarkable improvements of recent approaches, however, these methods are built upon a large set of training image-sentence pairs. The expensive labor efforts hence limit the captioning model to describe the wider world. In this paper, we present a novel network structure, Cascaded Revision Network, which aims at relieving the problem by equipping the model with out-of-domain knowledge. CRN first tries its best to describe an image using the existing vocabulary from in-domain knowledge. Due to the lack of out-of-domain knowledge, the caption may be inaccurate or include ambiguous words for the image with unknown (novel) objects. We propose to re-edit the primary captioning sentence by a series of cascaded operations. We introduce a perplexity predictor to find out which words are most likely to be inaccurate given the input image. Thereafter, we utilize external knowledge from a pre-trained object detection model and select more accurate words from detection results by the visual matching module. In the last step, we design a semantic matching module to ensure that the novel object is fit in the right position. By this novel cascaded captioning-revising mechanism, CRN can accurately describe images with unseen objects. We validate the proposed method with state-of-the-art performance on the held-out MSCOCO dataset as well as scale to ImageNet, demonstrating the effectiveness of this method.

preprint2019arXiv

Electrically Tunable Wafer-Sized Three-Dimensional Topological Insulator Thin Films Grown by Magnetron Sputtering

Three-dimensional (3D) topological insulators (TIs) are candidate materials for various electronic and spintronic devices due to their strong spin-orbit coupling and unique surface electronic structure. Rapid, low-cost preparation of large-area TI thin films compatible with conventional semiconductor technology is key to the practical applications of TIs. Here, we show that wafer-sized Bi2Te3 family TI and magnetic TI films with decent quality and well-controlled composition and properties can be prepared on amorphous SiO2/Si substrates by magnetron cosputtering. The SiO2/Si substrates enable us to electrically tune (Bi1-xSbx)2Te3 and Cr-doped (Bi1-xSbx)2Te3 TI films between p-type and n-type behavior and thus study the phenomena associated with topological surface states, such as the quantum anomalous Hall effect (QAHE). This work significantly facilitates the fabrication of TI-based devices for electronic and spintronic applications.