Source author record

Yi Ren

Yi Ren appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

55works

25topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Generate Your Talking Avatar from Video Reference

Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{https://www.heygen.com/research}{HeyGen Research} and \href{https://www.heygen.com/research/avatar-v-model}{HeyGen Avatar-V}.

preprint2024arXiv

Economics Arena for Large Language Models

Large language models (LLMs) have been extensively used as the backbones for general-purpose agents, and some economics literature suggest that LLMs are capable of playing various types of economics games. Following these works, to overcome the limitation of evaluating LLMs using static benchmarks, we propose to explore competitive games as an evaluation for LLMs to incorporate multi-players and dynamicise the environment. By varying the game history revealed to LLMs-based players, we find that most of LLMs are rational in that they play strategies that can increase their payoffs, but not as rational as indicated by Nash Equilibria (NEs). Moreover, when game history are available, certain types of LLMs, such as GPT-4, can converge faster to the NE strategies, which suggests higher rationality level in comparison to other models. In the meantime, certain types of LLMs can win more often when game history are available, and we argue that the winning rate reflects the reasoning ability with respect to the strategies of other players. Throughout all our experiments, we observe that the ability to strictly follow the game rules described by natural languages also vary among the LLMs we tested. In this work, we provide an economics arena for the LLMs research community as a dynamic simulation to test the above-mentioned abilities of LLMs, i.e. rationality, strategic reasoning ability, and instruction-following capability.

preprint2024arXiv

Evolved Massive Stars at Low-metallicity VI. Mass-Loss Rate of Red Supergiant Stars in the Large Magellanic Cloud

Mass loss is a crucial process that affects the observational properties, evolution path and fate of highly evolved stars. However, the mechanism of mass loss is still unclear, and the mass-loss rate (MLR) of red supergiant stars (RSGs) requires further research and precise evaluation. To address this, we utilized an updated and complete sample of RSGs in the Large Magellanic Cloud (LMC) and employed the 2-DUST radiation transfer model and spectral energy distribution (SED) fitting approach to determine the dust-production rates (DPRs) and dust properties of the RSGs. We have fitted 4,714 selected RSGs with over 100,000 theoretical templates of evolved stars. Our results show that the DPR range of RSGs in the LMC is $10^{-11}\, \rm{M_{\odot}\, yr^{-1}}$ to $10^{-7}\, \rm{M_{\odot}\, yr^{-1}}$, and the total DPR of all RSGs is 1.14 $\times 10^{-6} \, \rm{M_{\odot} \, yr^{-1}}$. We find that $63.3\%$ RSGs are oxygen-rich, and they account for $97.2\%$ of the total DPR. The optically thin RSG, which comprise $30.6\%$ of our sample, contribute only $0.1\%$ of the total DPR, while carbon-rich RSGs ($6.1\%$) produce $2.7\%$ of the total DPR. Overall, 208 RSGs contributed $76.6\%$ of the total DPR. We have established a new relationship between the MLR and luminosity of RSGs in the LMC, which exhibits a positive trend and a clear turning point at $\log{L/L_{\odot}} \approx 4.4$.

preprint2022arXiv

A Study of Syntactic Multi-Modality in Non-Autoregressive Machine Translation

It is difficult for non-autoregressive translation (NAT) models to capture the multi-modal distribution of target translations due to their conditional independence assumption, which is known as the "multi-modality problem", including the lexical multi-modality and the syntactic multi-modality. While the first one has been well studied, the syntactic multi-modality brings severe challenge to the standard cross entropy (XE) loss in NAT and is under studied. In this paper, we conduct a systematic study on the syntactic multi-modality problem. Specifically, we decompose it into short- and long-range syntactic multi-modalities and evaluate several recent NAT algorithms with advanced loss functions on both carefully designed synthesized datasets and real datasets. We find that the Connectionist Temporal Classification (CTC) loss and the Order-Agnostic Cross Entropy (OAXE) loss can better handle short- and long-range syntactic multi-modalities respectively. Furthermore, we take the best of both and design a new loss function to better handle the complicated syntactic multi-modality in real-world datasets. To facilitate practical usage, we provide a guide to use different loss functions for different kinds of syntactic multi-modality.

preprint2022arXiv

Attributable-Watermarking of Speech Generative Models

Generative models are now capable of synthesizing images, speeches, and videos that are hardly distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation and IP theft. This paper investigates a solution for model attribution, i.e., the classification of synthetic contents by their source models via watermarks embedded in the contents. Building on past success of model attribution in the image domain, we discuss algorithmic improvements for generating user-end speech models that empirically achieve high attribution accuracy, while maintaining high generation quality. We show the trade off between attributability and generation quality under a variety of attacks on generated speech signals attempting to remove the watermarks, and the feasibility of learning robust watermarks against these attacks.

preprint2022arXiv

Better Supervisory Signals by Observing Learning Paths

Better-supervised models might have better performance. In this paper, we first clarify what makes for good supervision for a classification problem, and then explain two existing label refining methods, label smoothing and knowledge distillation, in terms of our proposed criterion. To further answer why and how better supervision emerges, we observe the learning path, i.e., the trajectory of the model's predictions during training, for each training sample. We find that the model can spontaneously refine "bad" labels through a "zig-zag" learning path, which occurs on both toy and real datasets. Observing the learning path not only provides a new perspective for understanding knowledge distillation, overfitting, and learning dynamics, but also reveals that the supervisory signal of a teacher network can be very unstable near the best points in training on real tasks. Inspired by this, we propose a new knowledge distillation scheme, Filter-KD, which improves downstream classification performance in various settings.

preprint2022arXiv

Configuration-Aware Safe Control for Mobile Robotic Arm with Control Barrier Functions

Collision avoidance is a widely investigated topic in robotic applications. When applying collision avoidance techniques to a mobile robot, how to deal with the spatial structure of the robot still remains a challenge. In this paper, we design a configuration-aware safe control law by solving a Quadratic Programming (QP) with designed Control Barrier Functions (CBFs) constraints, which can safely navigate a mobile robotic arm to a desired region while avoiding collision with environmental obstacles. The advantage of our approach is that it correctly and in an elegant way incorporates the spatial structure of the mobile robotic arm. This is achieved by merging geometric restrictions among mobile robotic arm links into CBFs constraints. Simulations on a rigid rod and the modeled mobile robotic arm are performed to verify the feasibility and time-efficiency of proposed method. Numerical results about the time consuming for different degrees of freedom illustrate that our method scales well with dimension.

preprint2022arXiv

DA$^2$ Dataset: Toward Dexterity-Aware Dual-Arm Grasping

In this paper, we introduce DA$^2$, the first large-scale dual-arm dexterity-aware dataset for the generation of optimal bimanual grasping pairs for arbitrary large objects. The dataset contains about 9M pairs of parallel-jaw grasps, generated from more than 6000 objects and each labeled with various grasp dexterity measures. In addition, we propose an end-to-end dual-arm grasp evaluation model trained on the rendered scenes from this dataset. We utilize the evaluation model as our baseline to show the value of this novel and nontrivial dataset by both online analysis and real robot experiments. All data and related code will be open-sourced at https://sites.google.com/view/da2dataset.

preprint2022arXiv

Dependence of pulsation mode of Cepheids on metallicity

The Cepheid variables in SMC, LMC, the Milky Way, M33 and M31 are used to examine the dependence of pulsation mode on metallicity which was previously found in red supergiants. The initial samples of Cepheids are collected from the Cepheid catalogs identified from the OGLE, PS1, DIRECT, WISE and ZTF surveys. The contaminants are removed with the help of the Gaia/EDR3 astrometric information for extra galaxies or by comparing the geometric distance and the distance from the P-L relation for the Milky Way. The division of fundamental and first-overtone mode is refined according to the gap between the two modes in the P-L diagram of the objects in each galaxy. The ratio of FU/(FU+1O) is found to be 0.59, 0.60, 0.69, 0.83 and 0.85 for SMC, LMC, the Milky Way, M33 and M31 respectively in order of metallicity, which confirms that the pulsation mode depends on metallicity in the way that the ratio of FU/(FU+1O) increases with metallicity. This dependence is not changed if the incompleteness of the samples is taken into account.

preprint2022arXiv

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e.g., mel-spectrogram) given a music score. Previous singing acoustic models adopt a simple loss (e.g., L1 and L2) or generative adversarial network (GAN) to reconstruct the acoustic features, while they suffer from over-smoothing and unstable training issues respectively, which hinder the naturalness of synthesized singing. In this work, we propose DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model. DiffSinger is a parameterized Markov chain that iteratively converts the noise into mel-spectrogram conditioned on the music score. By implicitly optimizing variational bound, DiffSinger can be stably trained and generate realistic outputs. To further improve the voice quality and speed up inference, we introduce a shallow diffusion mechanism to make better use of the prior knowledge learned by the simple loss. Specifically, DiffSinger starts generation at a shallow step smaller than the total number of diffusion steps, according to the intersection of the diffusion trajectories of the ground-truth mel-spectrogram and the one predicted by a simple mel-spectrogram decoder. Besides, we propose boundary prediction methods to locate the intersection and determine the shallow step adaptively. The evaluations conducted on a Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work. Extensional experiments also prove the generalization of our methods on text-to-speech task (DiffSpeech). Audio samples: https://diffsinger.github.io. Codes: https://github.com/MoonInTheRiver/DiffSinger. The old title of this work: "Diffsinger: Diffusion acoustic model for singing voice synthesis".

preprint2022arXiv

Dust Extinction Law in Nearby Star-Resolved Galaxies. II. M33 Traced by Supergiants

The dust extinction curves toward individual sight lines in M33 are derived for the first time with a sample of reddened O-type and B-type supergiants obtained from the LGGS. The observed photometric data are obtained from the LGGS, PS1 Survey, UKIRT, PHATTER Survey, GALEX, Swift/UVOT and XMM-SUSS. We combine the intrinsic spectral energy distributions (SEDs) obtained from the ATLAS9 and Tlusty stellar model atmosphere extinguished by the model extinction curves from the silicate-graphite dust model to construct model SEDs. The extinction traces are distributed along the arms in M33, and the derived extinction curves cover a wide range of shapes ($R_V \approx 2-6$), indicating the complexity of the interstellar environment and the inhomogeneous distribution of interstellar dust in M33. The average extinction curve with $R_V \approx 3.39$ and dust size distribution $dn/da \sim a^{-3.45}{\rm exp}(-a/0.25)$ is similar to that of the MW but with a weaker 2175 Ang bump and a slightly steeper rise in the far-UV band. The extinction in the $V$ band of M33 is up to 2 mag, with a median value of $ A_V \approx 0.43$ mag. The multiband extinction values from the UV to IR bands are also predicted for M33, which will provide extinction corrections for future works. The method adopted in this work is also applied to other star-resolved galaxies (NGC 6822 and WLM), but only a few extinction curves can be derived because of the limited observations.

preprint2022arXiv

Expressivity of Emergent Language is a Trade-off between Contextual Complexity and Unpredictability

Researchers are using deep learning models to explore the emergence of language in various language games, where agents interact and develop an emergent language to solve tasks. We focus on the factors that determine the expressivity of emergent languages, which reflects the amount of information about input spaces those languages are capable of encoding. We measure the expressivity of emergent languages based on the generalisation performance across different games, and demonstrate that the expressivity of emergent languages is a trade-off between the complexity and unpredictability of the context those languages emerged from. Another contribution of this work is the discovery of message type collapse, i.e. the number of unique messages is lower than that of inputs. We also show that using the contrastive loss proposed by Chen et al. (2020) can alleviate this problem.

preprint2022arXiv

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.

preprint2022arXiv

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.

preprint2022arXiv

Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning

Multiple-choice VQA has drawn increasing attention from researchers and end-users recently. As the demand for automatically constructing large-scale multiple-choice VQA data grows, we introduce a novel task called textual Distractors Generation for VQA (DG-VQA) focusing on generating challenging yet meaningful distractors given the context image, question, and correct answer. The DG-VQA task aims at generating distractors without ground-truth training samples since such resources are rarely available. To tackle the DG-VQA unsupervisedly, we propose Gobbet, a reinforcement learning(RL) based framework that utilizes pre-trained VQA models as an alternative knowledge base to guide the distractor generation process. In Gobbet, a pre-trained VQA model serves as the environment in RL setting to provide feedback for the input multi-modal query, while a neural distractor generator serves as the agent to take actions accordingly. We propose to use existing VQA models' performance degradation as indicators of the quality of generated distractors. On the other hand, we show the utility of generated distractors through data augmentation experiments, since robustness is more and more important when AI models apply to unpredictable open-domain scenarios or security-sensitive applications. We further conduct a manual case study on the factors why distractors generated by Gobbet can fool existing models.

preprint2022arXiv

Improving Item Cold-start Recommendation via Model-agnostic Conditional Variational Autoencoder

Embedding & MLP has become a paradigm for modern large-scale recommendation system. However, this paradigm suffers from the cold-start problem which will seriously compromise the ecological health of recommendation systems. This paper attempts to tackle the item cold-start problem by generating enhanced warmed-up ID embeddings for cold items with historical data and limited interaction records. From the aspect of industrial practice, we mainly focus on the following three points of item cold-start: 1) How to conduct cold-start without additional data requirements and make strategy easy to be deployed in online recommendation scenarios. 2) How to leverage both historical records and constantly emerging interaction data of new items. 3) How to model the relationship between item ID and side information stably from interaction data. To address these problems, we propose a model-agnostic Conditional Variational Autoencoder based Recommendation(CVAR) framework with some advantages including compatibility on various backbones, no extra requirements for data, utilization of both historical data and recent emerging interactions. CVAR uses latent variables to learn a distribution over item side information and generates desirable item ID embeddings using a conditional decoder. The proposed method is evaluated by extensive offline experiments on public datasets and online A/B tests on Tencent News recommendation platform, which further illustrate the advantages and robustness of CVAR.

preprint2022arXiv

Kronecker-factored Quasi-Newton Methods for Deep Learning

Second-order methods have the capability of accelerating optimization by using much richer curvature information than first-order methods. However, most are impractical for deep learning, where the number of training parameters is huge. In Goldfarb et al. (2020), practical quasi-Newton methods were proposed that approximate the Hessian of a multilayer perceptron (MLP) model by a layer-wise block diagonal matrix where each layer's block is further approximated by a Kronecker product corresponding to the structure of the Hessian restricted to that layer. Here, we extend these methods to enable them to be applied to convolutional neural networks (CNNs), by analyzing the Kronecker-factored structure of the Hessian matrix of convolutional layers. Several improvements to the methods in Goldfarb et al. (2020) are also proposed that can be applied to both MLPs and CNNs. These new methods have memory requirements comparable to first-order methods and much less per-iteration time complexity than those in Goldfarb et al. (2020). Moreover, convergence results are proved for a variant under relatively mild conditions. Finally, we compared the performance of our new methods against several state-of-the-art (SOTA) methods on MLP autoencoder and CNN problems, and found that they outperformed the first-order SOTA methods and performed comparably to the second-order SOTA methods.

preprint2022arXiv

Learning the Beauty in Songs: Neural Singing Voice Beautifier

We are interested in a novel task, singing voice beautifying (SVB). Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre. Current automatic pitch correction techniques are immature, and most of them are restricted to intonation but ignore the overall aesthetic quality. Hence, we introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task, which adopts a conditional variational autoencoder as the backbone and learns the latent representations of vocal tone. In NSVB, we propose a novel time-warping approach for pitch correction: Shape-Aware Dynamic Time Warping (SADTW), which ameliorates the robustness of existing time-warping approaches, to synchronize the amateur recording with the template pitch curve. Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one. To achieve this, we also propose a new dataset containing parallel singing recordings of both amateur and professional versions. Extensive experiments on both Chinese and English songs demonstrate the effectiveness of our methods in terms of both objective and subjective metrics. Audio samples are available at~\url{https://neuralsvb.github.io}. Codes: \url{https://github.com/MoonInTheRiver/NeuralSVB}.

preprint2022arXiv

MIC: Model-agnostic Integrated Cross-channel Recommenders

Semantically connecting users and items is a fundamental problem for the matching stage of an industrial recommender system. Recent advances in this topic are based on multi-channel retrieval to efficiently measure users' interest on items from the massive candidate pool. However, existing work are primarily built upon pre-defined retrieval channels, including User-CF (U2U), Item-CF (I2I), and Embedding-based Retrieval (U2I), thus access to the limited correlation between users and items which solely entail from partial information of latent interactions. In this paper, we propose a model-agnostic integrated cross-channel (MIC) approach for the large-scale recommendation, which maximally leverages the inherent multi-channel mutual information to enhance the matching performance. Specifically, MIC robustly models correlation within user-item, user-user, and item-item from latent interactions in a universal schema. For each channel, MIC naturally aligns pairs with semantic similarity and distinguishes them otherwise with more uniform anisotropic representation space. While state-of-the-art methods require specific architectural design, MIC intuitively considers them as a whole by enabling the complete information flow among users and items. Thus MIC can be easily plugged into other retrieval recommender systems. Extensive experiments show that our MIC helps several state-of-the-art models boost their performance on two real-world benchmarks. The satisfactory deployment of the proposed MIC on industrial online services empirically proves its scalability and flexibility.

preprint2022arXiv

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Multi-speaker singing voice synthesis is to generate the singing voice sung by different speakers. To generalize to new speakers, previous zero-shot singing adaptation methods obtain the timbre of the target speaker with a fixed-size embedding from single reference audio. However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3) the pitch inconsistency between different speakers also leads to a degradation in the generated voice. In this paper, we propose a new model called MR-SVS to tackle these problems. Specifically, we employ both a multi-reference encoder and a fixed-size encoder to encode the timbre of the target speaker from multiple reference audios. The Multi-reference encoder can capture more details and variations of the target timbre. Besides, we propose a well-designed pitch shift method to address the pitch inconsistency problem. Experiments indicate that our method outperforms the baseline method both in naturalness and similarity.

preprint2022arXiv

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture. 2) To further compress the model size and memory footprint, we introduce the grouped parameter sharing mechanism to the affine coupling layers in the post-net. 3) To improve the expressiveness of synthesized speech and reduce the dependency on accurate fine-grained alignment between text and speech, we propose a linguistic encoder with mixture alignment combining hard inter-word alignment and soft intra-word alignment, which explicitly extracts word-level semantic information. Experimental results show that PortaSpeech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with FastSpeech 2). Our extensive ablation studies demonstrate that each design in PortaSpeech is effective.

preprint2022arXiv

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}

preprint2022arXiv

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.

preprint2022arXiv

Revisiting Over-Smoothness in Text to Speech

Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide additional condition inputs reduce the complexity of data distributions to model, thus alleviating the over-smoothing problem and achieving better voice quality. 2) Among advanced modeling methods, Laplacian mixture loss performs well at modeling multimodal distributions and enjoys its simplicity, while GAN and Glow achieve the best voice quality while suffering from increased training or model complexity. 3) The two categories of methods can be combined to further alleviate the over-smoothness and improve the voice quality. 4) Our experiments on the multi-speaker dataset lead to similar conclusions as above and providing more variance information can reduce the difficulty of modeling the target data distribution and alleviate the requirements for model capacity.

preprint2022arXiv

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

Deep generative models have achieved significant progress in speech synthesis to date, while high-fidelity singing voice synthesis is still an open problem for its long continuous pronunciation, rich high-frequency parts, and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches and poor high-frequency reconstruction. In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis. Specifically, 1) to alleviate the glitch problem in the generated samples, we propose source excitation with the adaptive feature learning filters to expand the receptive field patterns and stabilize long continuous signal generation; and 2) SingGAN introduces global and local discriminators at different scales to enrich low-frequency details and promote high-frequency reconstruction; and 3) To improve the training efficiency, SingGAN includes auxiliary spectrogram losses and sub-band feature matching penalty loss. To the best of our knowledge, SingGAN is the first work designed toward high-fidelity singing voice vocoding. Our evaluation of SingGAN demonstrates the state-of-the-art results with higher-quality (MOS 4.05) samples. Also, SingGAN enables a sample speed of 50x faster than real-time on a single NVIDIA 2080Ti GPU. We further show that SingGAN generalizes well to the mel-spectrogram inversion of unseen singers, and the end-to-end singing voice synthesis system SingGAN-SVS enjoys a two-stage pipeline to transform the music scores into expressive singing voices. Audio samples are available at \url{https://SingGAN.github.io/}

preprint2022arXiv

SSR-GNNs: Stroke-based Sketch Representation with Graph Neural Networks

This paper follows cognitive studies to investigate a graph representation for sketches, where the information of strokes, i.e., parts of a sketch, are encoded on vertices and information of inter-stroke on edges. The resultant graph representation facilitates the training of a Graph Neural Networks for classification tasks, and achieves accuracy and robustness comparable to the state-of-the-art against translation and rotation attacks, as well as stronger attacks on graph vertices and topologies, i.e., modifications and addition of strokes, all without resorting to adversarial training. Prior studies on sketches, e.g., graph transformers, encode control points of stroke on vertices, which are not invariant to spatial transformations. In contrary, we encode vertices and edges using pairwise distances among control points to achieve invariance. Compared with existing generative sketch model for one-shot classification, our method does not rely on run-time statistical inference. Lastly, the proposed representation enables generation of novel sketches that are structurally similar to while separable from the existing dataset.

preprint2022arXiv

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

The recent progress in non-autoregressive text-to-speech (NAR-TTS) has made fast and high-quality speech synthesis possible. However, current NAR-TTS models usually use phoneme sequence as input and thus cannot understand the tree-structured syntactic information of the input sequence, which hurts the prosody modeling. To this end, we propose SyntaSpeech, a syntax-aware and light-weight NAR-TTS model, which integrates tree-structured syntactic information into the prosody modeling modules in PortaSpeech \cite{ren2021portaspeech}. Specifically, 1) We build a syntactic graph based on the dependency tree of the input sentence, then process the text encoding with a syntactic graph encoder to extract the syntactic information. 2) We incorporate the extracted syntactic encoding with PortaSpeech to improve the prosody prediction. 3) We introduce a multi-length discriminator to replace the flow-based post-net in PortaSpeech, which simplifies the training pipeline and improves the inference speed, while keeping the naturalness of the generated audio. Experiments on three datasets not only show that the tree-structured syntactic information grants SyntaSpeech the ability to synthesize better audio with expressive prosody, but also demonstrate the generalization ability of SyntaSpeech to adapt to multiple languages and multi-speaker text-to-speech. Ablation studies demonstrate the necessity of each component in SyntaSpeech. Source code and audio samples are available at https://syntaspeech.github.io

preprint2022arXiv

Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual Patterns

Recent studies demonstrated the vulnerability of control policies learned through deep reinforcement learning against adversarial attacks, raising concerns about the application of such models to risk-sensitive tasks such as autonomous driving. Threat models for these demonstrations are limited to (1) targeted attacks through real-time manipulation of the agent's observation, and (2) untargeted attacks through manipulation of the physical environment. The former assumes full access to the agent's states/observations at all times, while the latter has no control over attack outcomes. This paper investigates the feasibility of targeted attacks through visually learned patterns placed on physical objects in the environment, a threat model that combines the practicality and effectiveness of the existing ones. Through analysis, we demonstrate that a pre-trained policy can be hijacked within a time window, e.g., performing an unintended self-parking, when an adversarial object is present. To enable the attack, we adopt an assumption that the dynamics of both the environment and the agent can be learned by the attacker. Lastly, we empirically show the effectiveness of the proposed attack on different driving scenarios, perform a location robustness test, and study the tradeoff between the attack strength and its effectiveness. Code is available at https://github.com/ASU-APG/Targeted-Physical-Adversarial-Attacks-on-AD

preprint2022arXiv

Video-based Facial Micro-Expression Analysis: A Survey of Datasets, Features and Algorithms

Unlike the conventional facial expressions, micro-expressions are involuntary and transient facial expressions capable of revealing the genuine emotions that people attempt to hide. Therefore, they can provide important information in a broad range of applications such as lie detection, criminal detection, etc. Since micro-expressions are transient and of low intensity, however, their detection and recognition is difficult and relies heavily on expert experiences. Due to its intrinsic particularity and complexity, video-based micro-expression analysis is attractive but challenging, and has recently become an active area of research. Although there have been numerous developments in this area, thus far there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluation. Accordingly, in this survey paper, we first highlight the key differences between macro- and micro-expressions, then use these differences to guide our research survey of video-based micro-expression analysis in a cascaded structure, encompassing the neuropsychological basis, datasets, features, spotting algorithms, recognition algorithms, applications and evaluation of state-of-the-art approaches. For each aspect, the basic techniques, advanced developments and major challenges are addressed and discussed. Furthermore, after considering the limitations of existing micro-expression datasets, we present and release a new dataset - called micro-and-macro expression warehouse (MMEW) - containing more video samples and more labeled emotion types. We then perform a unified comparison of representative methods on CAS(ME)2 for spotting, and on MMEW and SAMM for recognition, respectively. Finally, some potential future research directions are explored and outlined.

preprint2022arXiv

Video-Guided Curriculum Learning for Spoken Video Grounding

In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy that gradually shifts the input video from the ground truth to the entire video content during pre-training. Finally, the model can learn how to extract critical visual information from the entire video clip to help understand the spoken language. In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet, which is named as ActivityNet Speech dataset. Extensive experiments demonstrate our proposed video-guided curriculum learning can facilitate the pre-training process to obtain a mutual audio encoder, significantly promoting the performance of spoken video grounding tasks. Moreover, we prove that in the case of noisy sound, our model outperforms the method that grounding video with ASR transcripts, further demonstrating the effectiveness of our curriculum strategy.

preprint2021arXiv

Dust Extinction Law in Nearby Star-Resolved Galaxies. I. M31 Traced by Supergiants

The dust extinction laws and dust properties in M31 are explored with a sample of reddened O-type and B-type supergiants obtained from the LGGS. The observed spectral energy distributions (SEDs) for each tracer are constructed with multiband photometry from the LGGS, PS1 Survey, UKIRT, PHAT Survey, Swift/UVOT and XMM-SUSS. We model the SED for each tracer in combination with the intrinsic spectrum obtained from the stellar model atmosphere extinguished by the model extinction curves. Instead of mathematically parameterizing the extinction functions, the model extinction curves in this work are directly derived from the silicate-graphite dust model with a dust size distribution of $dn/da \sim a^{-α}{\rm exp}(-a/0.25),~0.005 < a < 5~μ{\rm m}$. The extinction tracers are distributed along the arms in M31, with the derived MW-type extinction curves covering a wide range of $R_V$ ($\approx 2 - 6$), indicating the complexity of the interstellar environment and the inhomogeneous distribution of interstellar dust in M31. The average extinction curve with $R_V \approx 3.51$ and dust size distribution $dn/da \sim a^{-3.35}{\rm exp}(-a/0.25)$ is similar to those of the MW but rises slightly less steeply in the far-UV bands, implying that the overall interstellar environment in M31 resembles the diffuse region in the MW. The extinction in the $V$ band of M31 is up to 3 mag, with a median value of $ A_V \approx 1$ mag. The multiband extinction values from the UV to IR bands are also predicted for M31, which will provide a general extinction correction for future works.

preprint2021arXiv

Edge and sublayer degrees of freedom for phosphorene nanoribbons with twofold-degenerate edge bands via electric field

For the pristine phosphorene nanoribbons (PNRs) with edge states, there exist two categories of edge bands near the Fermi energy (EF), i.e., the shuttle-shaped twofold-degenerate and the near-flat simple degenerate edge bands. However, the usual experimental measurement may not distinguish the difference between the two categories of edge bands. Here we study the varying rule for the edge bands of PNRs under an external electrostatic field. By using the KWANT code based on the tight-binding approach, we find that the twofold-degenerate edge bands can be divided into two separated shuttles until the degeneracy is completely removed and a gap near EFis opened under a sufficiently strong in-plane electric field. Importantly, each shuttle from the ribbon upper or lower edge outmost atoms is identified according to the local density of states. However, under a small off-plane field the shuttle-shaped bands are easily induced into two near-flat bands contributed from the edge atoms of the top and bottom sublayers, respectively. The evidence provides the edge and sublayer degrees of freedom (DOF) for the PNRs with shuttle-shaped edge bands, of which is obviously different from another category PNRs intrinsically with near-flat edge bands. This is because that the former category of ribbons solely have four zigzag-like atomic configurations at the edges in each unit cell, which also results in that the property is robust against the point defect in the ribbon center area. As an application, furthermore, based on this issue we propose a homogenous junction of a shuttle-edge-band PNR attached by two electric gates. Interestingly, the transport property of the junction with field manipulation well reflects the characteristics of the two DOFs. These findings may provide a further understanding on PNRs and initiate new developments in PNR-based electronics.

preprint2021arXiv

Practical Quasi-Newton Methods for Training Deep Neural Networks

We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n \times n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

preprint2021arXiv

The Dust Mass of Supernova Remnants in M31

The dust temperature and mass of the supernova remnants (SNRs) in M31 are estimated by fitting the infrared spectral energy distribution calculated from the images in the Spitzer/IRAC4 and MIPS24, Herschel/PACS70, 100, 160, and Herschel/SPIRE250, 350$μ$m band. Twenty SNRs with relatively reliable photometry exhibit an average dust temperature of $20.1^{+1.8}_{-1.5}$K, which is higher than the surrounding and indicating the heating effect of supernova explosion. The dust mass of these SNRs ranges from about 100 to 800$ M_{\odot}$, much bigger than the SNRs in the Milky Way. On the other hand, this yields the dust surface density of $0.10^{+0.07}_{-0.04}{ M_{\odot} \rm pc^{-2}}$, about half of the surrounding area, which implies that about half dust in the SNRs is destroyed by the supernova explosion. The dust temperature, the radius, and thus the dust mass all demonstrate that the studied SNRs are old and very likely in the snowplow or even fade away phase because of the limitation by the far distance and observation resolution of M31, and the results can serve as a reference to the final effect of supernova explosion on the surrounding dust.

preprint2021arXiv

The Sample of Red Supergiants in Twelve Low-Mass Galaxies of the Local Group

This work establishes the most complete sample of red supergiants (RSGs) in twelve low-mass galaxies (WLM, IC 10, NGC 147, NGC 185, IC 1613, Leo A, Sextans B, Sextans A, NGC 6822, Pegasus Dwarf, SMC and LMC) of the Local Group, which forms the solid basis to study the properties of RSGs as well as the star formation rate (SFR) and initial mass function (IMF) of the galaxies. After removing the foreground dwarf stars by their obvious branch in the near-infrared color-color diagram ($(J-H)_0/(H-K)_0$) with the UKIRT/WFCAM and 2MASS photometry as well as the Gaia/EDR3 measurements of proper motion and parallax, RSGs are identified from their location in the color-magnitude diagram $(J-K)_{0}/K_{0}$ of the member stars of the specific galaxy. A total of 2,190 RSGs are found in ten dwarf galaxies, and additionally 4,823 and 2,138 RSGs in the LMC and SMC respectively. The locations of the tip of the red giant branch in the $(J-K)_{0}/K_{0}$ diagram are determined to serve as an indicator of the metallicity and distance modulus of the galaxies.

preprint2020arXiv

A Study of Non-autoregressive Model for Sequence Generation

Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel, resulting in faster generation speed compared to their autoregressive (AR) counterparts but at the cost of lower accuracy. Different techniques including knowledge distillation and source-target alignment have been proposed to bridge the gap between AR and NAR models in various tasks such as neural machine translation (NMT), automatic speech recognition (ASR), and text to speech (TTS). With the help of those techniques, NAR models can catch up with the accuracy of AR models in some tasks but not in some others. In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all? (2) Why techniques like knowledge distillation and source-target alignment can help NAR models. Since the main difference between AR and NAR models is that NAR models do not use dependency among target tokens while AR models do, intuitively the difficulty of NAR sequence generation heavily depends on the strongness of dependency among target tokens. To quantify such dependency, we propose an analysis model called CoMMA to characterize the difficulty of different NAR sequence generation tasks. We have several interesting findings: 1) Among the NMT, ASR and TTS tasks, ASR has the most target-token dependency while TTS has the least. 2) Knowledge distillation reduces the target-token dependency in target sequence and thus improves the accuracy of NAR models. 3) Source-target alignment constraint encourages dependency of a target token on source tokens and thus eases the training of NAR models.

preprint2020arXiv

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) a denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of language modeling both in speech and text domain; (2) dual transformation, where the TTS model transforms the text $y$ into speech $\hat{x}$, and the ASR model leverages the transformed pair $(\hat{x},y)$ for training, and vice versa, to boost the accuracy of the two tasks; (3) bidirectional sequence modeling, which addresses error propagation especially in the long speech and text sequence when training with few paired data; (4) a unified model structure, which combines all the above components for TTS and ASR based on Transformer model. Our method achieves 99.84% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR on LJSpeech dataset, by leveraging only 200 paired speech and text data (about 20 minutes audio), together with extra unpaired speech and text data.

preprint2020arXiv

Compositional Languages Emerge in a Neural Iterated Learning Model

The principle of compositionality, which enables natural language to represent complex concepts via a structured combination of simpler ones, allows us to convey an open-ended set of messages using a limited vocabulary. If compositionality is indeed a natural property of language, we may expect it to appear in communication protocols that are created by neural agents in language games. In this paper, we propose an effective neural iterated learning (NIL) algorithm that, when applied to interacting neural agents, facilitates the emergence of a more structured type of language. Indeed, these languages provide learning speed advantages to neural agents during training, which can be incrementally amplified via NIL. We provide a probabilistic model of NIL and an explanation of why the advantage of compositional language exist. Our experiments confirm our analysis, and also demonstrate that the emerged languages largely improve the generalizing power of the neural agent communication.

preprint2020arXiv

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in https://speechresearch.github.io/deepsinger/.)

preprint2020arXiv

Degrees of symmetric Grothendieck polynomials and Castelnuovo-Mumford regularity

We give an explicit formula for the degree of the Grothendieck polynomial of a Grassmannian permutation and a closely related formula for the Castelnuovo-Mumford regularity of the Schubert determinantal ideal of a Grassmannian permutation. We then provide a counterexample to a conjecture of Kummini-Lakshmibai-Sastry-Seshadri on a formula for regularities of standard open patches of particular Grassmannian Schubert varieties and show that our work gives rise to an alternate explicit formula in these cases. We end with a new conjecture on the regularities of standard open patches of arbitrary Grassmannian Schubert varieties.

preprint2020arXiv

Evolved Massive Stars at Low-metallicity II. Red Supergiant Stars in the Small Magellanic Cloud

We present the most comprehensive RSG sample for the SMC up to now, including 1,239 RSG candidates. The initial sample is derived based on a source catalog for the SMC with conservative ranking. Additional spectroscopic RSGs are retrieved from the literature, as well as RSG candidates selected from the inspection of CMDs. We estimate that there are in total $\sim$ 1,800 or more RSGs in the SMC. We purify the sample by studying the infrared CMDs and the variability of the objects, though there is still an ambiguity between AGBs and RSGs. There are much less RSGs candidates ($\sim$4\%) showing PAH emission features compared to the Milky Way and LMC ($\sim$15\%). The MIR variability of RSG sample increases with luminosity. We separate the RSG sample into two subsamples ("risky" and "safe") and identify one M5e AGB star in the "risky" subsample. Most of the targets with large variability are also the bright ones with large MLR. Some targets show excessive dust emission, which may be related to previous episodic mass loss events. We also roughly estimate the total gas and dust budget produced by entire RSG population as $\rm \sim1.9^{+2.4}_{-1.1}\times10^{-6}~M_{\odot}/yr$ in the most conservative case. Based on the MIST models, we derive a linear relation between $T_{\rm eff}$ and observed $\rm J-K_S$ color with reddening correction for the RSG sample. By using a constant bolometric correction and this relation, the Geneva evolutionary model is compared with our RSG sample, showing a good agreement and a lower initial mass limit of $\sim$7 $\rm M_\odot $ for the RSG population. Finally, we compare the RSG sample in the SMC and the LMC. Despite the incompleteness of LMC sample in the faint end, the result indicates that the LMC sample always shows redder color (except for the $\rm IRAC1-IRAC2$ and $\rm WISE1-WISE2$ colors due to CO absorption) and larger variability than the SMC sample.

preprint2020arXiv

Evolved Massive Stars at Low-metallicity III. A Source Catalog for the Large Magellanic Cloud

We present a clean, magnitude-limited (IRAC1 or WISE1$\leq$15.0 mag) multiwavelength source catalog for the LMC. The catalog was built upon crossmatching ($1''$) and deblending ($3''$) between the SEIP source list and Gaia DR2, with strict constraints on the Gaia astrometric solution to remove the foreground contamination. The catalog contains 197,004 targets in 52 different bands including 2 ultraviolet, 21 optical, and 29 infrared bands. Additional information about radial velocities and spectral/photometric classifications were collected from the literature. The bright end of our sample is mostly comprised of blue helium-burning stars (BHeBs) and red HeBs with inevitable contamination of main sequence stars at the blue end. After applying modified magnitude and color cuts based on previous studies, we identify and rank 2,974 RSG, 508 YSG, and 4,786 BSG candidates in the LMC in six CMDs. The comparison between the CMDs of the LMC and SMC indicates that the most distinct difference appears at the bright red end of the optical and near-infrared CMDs, where the cool evolved stars (e.g., RSGs, AGB, and RGs) are located, which is likely due to the effect of metallicity and SFH. Further quantitative comparison of colors of massive star candidates in equal absolute magnitude bins suggests that, there is basically no difference for the BSG candidates, but large discrepancy for the RSG candidates as LMC targets are redder than the SMC ones, which may be due to the combined effect of metallicity on both spectral type and mass-loss rate, and also the age effect. The $T_{\rm eff}$ of massive star populations are also derived from reddening-free color of $(J-K_{\rm S})_0$. The $T_{\rm eff}$ ranges are $3500<T_{\rm eff}<5000$ K for RSG population, $5000<T_{\rm eff}<8000$ K for YSG population, and $T_{\rm eff}>8000$ K for BSG population, with larger uncertainties towards the hotter stars.

preprint2020arXiv

Fully distributed cooperation for networked uncertain mobile manipulators

This paper investigates a fully distributed cooperation scheme for networked mobile manipulators. To achieve cooperative task allocation in a distributed way, an adaptation-based estimation law is established for each robotic agent to estimate the desired local trajectory. In addition, wrench synthesis is analyzed in detail to lay a solid foundation for tight cooperation tasks. Together with the estimated task, a set of distributed adaptive controllers is proposed to achieve motion synchronization of the mobile manipulator ensemble over a directed graph with a spanning tree irrespective of the kinematic and dynamic uncertainties in both the mobile manipulators and the tightly grasped object. The controlled synchronization alleviates the performance degradation caused by the estimation/tracking discrepancy during the transient phase. The proposed scheme requires no persistent excitation condition and avoids the use of noisy Cartesian-space velocities. Furthermore, it is independent from the object's center of mass by employing formation-based task allocation and a task-oriented strategy. These attractive attributes facilitate the practical application of the scheme. It is theoretically proven that convergence of the cooperative task tracking error is guaranteed. Simulation results validate the efficacy and demonstrate the expected performance of the proposed scheme.

preprint2020arXiv

Information Content of Hierarchical n-Point Polytope Functions for Quantifying and Reconstructing Disordered Systems

Disordered systems are ubiquitous in physical, biological and material sciences. Examples include liquid and glassy states of condensed matter, colloids, granular materials, porous media, composites, alloys, packings of cells in avian retina and tumor spheroids, to name but a few. A comprehensive understanding of such disordered systems requires, as the first step, systematic quantification, modeling and representation of the underlying complex configurations and microstructure, which is generally very challenging to achieve. Recently, we introduce a set of hierarchical statistical microstructural descriptors, i.e., the n-point polytope functions Pn, which are derived from the standard n-point correlation functions Sn, and successively include higher-order n-point statistics of the morphological features of interest in a concise, explainable, and expressive manner. Here we investigate the information content of the Pn functions via optimization-based realization rendering. This is achieved by successively incorporating higher order Pn functions up to n = 8 and quantitatively assessing the accuracy of the reconstructed systems via un-constrained statistical morphological descriptors (e.g., the lineal-path function). We examine a wide spectrum of representative random systems with distinct geometrical and topological features. We find that generally, successively incorporating higher order Pn functions, and thus, the higher-order morphological information encoded in these descriptors, leads to superior accuracy of the reconstructions. However, incorporating more Pn functions into the reconstruction also significantly increases the complexity and roughness of the associated energy landscape for the underlying stochastic optimization, making it difficult to convergence numerically.

preprint2020arXiv

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than 98% intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.

preprint2020arXiv

Microstructure Design of Low-Melting-Point Alloy (LMPA)/ Polymer Composites for Dynamic Dry Adhesion Tuning in Soft Gripping

Tunable dry adhesion is a crucial mechanism in compliant manipulation. The gripping force, mainly originated from the van der Waals force between the adhesive composite and the object to be gripped, can be controlled by reversibly varying the physical properties (e.g., stiffness) of the composite via external stimuli. The maximal gripping force Fmax and its tunability depend on, among other factors, the stress distribution on the gripping interface and its fracture dynamics (during detaching), which in turn are determined by the composite microstructure. Here, we present a computational framework for the modeling and design of a class of binary smart composites containing a porous low-melting-point alloy (LMPA) phase and a polymer phase, in order to achieve desirable dynamically tunable dry adhesion. In particular, we employ spatial correlation functions to quantify, model and represent the complex bi-continuous microstructure of the composites, from which a wide spectrum of realistic virtual 3D composite microstructures can be generated using stochastic optimization. A recently developed volume-compensated lattice-particle (VCLP) method is then employed to model the dynamic interfacial fracture process to compute Fmax for different composite microstructures. We focus on the interface defect tuning (IDT) mechanism for dry adhesion tuning enabled by the composite, in which the thermal expansion of the LMPA phase due to Joule heating initializes small cracks on the adhesion interface, subsequently causing the detachment of the gripper from the object due to interfacial fracture. We find that for an optimal microstructure among the ones studied here, a 10-fold dynamic tuning of Fmax before and after the thermal expansion of the LMPA phase can be achieved. Our computational results can provide valuable guidance for experimental fabrication of the LMPA-polymer composites.

preprint2020arXiv

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

preprint2020arXiv

On Granulation and Irregular Variation of Red Supergiants

The mechanism and characteristics of the irregular variations of red supergiants (RSGs) are studied based on the RSG samples in Small Magellanic Cloud (SMC), Large Magellanic Cloud (LMC) and M31. With the time-series data from All-Sky Automated Survey for SuperNovae (ASAS-SN) and Intermediate Palomar Transient Factory survey, we use the continuous time autoregressive moving average model to estimate the variability features of the light curves and their power spectral density. The characteristic evolution timescale and amplitude of granulations are further derived from fitting the posterior power spectral density with the COR function, which is a Harvey-like granulation model. The consistency of theoretical predictions and results is checked to verify the correctness of the assumption that granulations on RSGs contribute to irregular variation. The relations between granulation and stellar parameters are obtained and compared with the results of red giant branch stars and Betelgeuse. It is found that the relations are in agreement with predictions from basic physical process of granulation and fall close to the extrapolated relations of RGB stars. The granulations in most of the RSGs evolve at a timescale of several days to a year with the characteristic amplitude of 10-1000 mmag. The results imply that the irregular variations of RSGs can be attributed to the evolution of granulations. When comparing the results from SMC, LMC and M31, the timescale and amplitude of granulation seem to increase with metallicity. The analytical relations of the granulation parameters with stellar parameters are derived for the RSG sample of each galaxy.

preprint2020arXiv

PopMAG: Pop Music Accompaniment Generation

In pop music, accompaniments are usually played by multiple instruments (tracks) such as drum, bass, string and guitar, and can make a song more expressive and contagious by arranging together with its melody. Previous works usually generate multiple tracks separately and the music notes from different tracks not explicitly depend on each other, which hurts the harmony modeling. To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks. While this greatly improves harmony, unfortunately, it enlarges the sequence length and brings the new challenge of long-term music modeling. We further introduce two new techniques to address this challenge: 1) We model multiple note attributes (e.g., pitch, duration, velocity) of a musical note in one step instead of multiple steps, which can shorten the length of a MuMIDI sequence. 2) We introduce extra long-context as memory to capture long-term dependency in music. We call our system for pop music accompaniment generation as PopMAG. We evaluate PopMAG on multiple datasets (LMD, FreeMidi and CPMD, a private dataset of Chinese pop songs) with both subjective and objective metrics. The results demonstrate the effectiveness of PopMAG for multi-track harmony modeling and long-term context modeling. Specifically, PopMAG wins 42\%/38\%/40\% votes when comparing with ground truth musical pieces on LMD, FreeMidi and CPMD datasets respectively and largely outperforms other state-of-the-art music accompaniment generation models and multi-track MIDI representations in terms of subjective and objective metrics.

preprint2020arXiv

Red Supergiants in M31 and M33 I. The Complete Sample

The aim of this paper is to establish a complete sample of red supergiants (RSGs) in M31 and M33. The member stars of the two galaxies are selected from the near-infrared (NIR) point sources after removing the foreground dwarfs from their obvious branch in the $J-H/H-K$ diagram with the archival photometric data taken by the UKIRT/WFCAM. This separation by NIR colors of dwarfs from giants is confirmed by the optical/infrared color-color diagrams ($r-z/z-H$ and $B-V/V-R$), and the Gaia measurement of parallax and proper motion. The RSGs are then identified by their outstanding location in the members' $J-K/K$ diagram due to high luminosity and low effective temperature. The resultant sample has 5,498 and 3,055 RSGs in M31 and M33 respectively, which should be complete because the lower limiting $K$ magnitude of RSGs in both cases is brighter than the complete magnitude of the UKIRT photometry. Analysis of the control fields finds that the pollution rate in the RSGs sample is less than 1\%. The by-product is the complete sample of oxygen-rich asymptotic giant branch stars (AGBs), carbon-rich AGBs, thermally pulsing AGBs and extreme AGBs. In addition, the tip-RGB is determined together with its implication on the distance modulus to M31 and M33.

preprint2020arXiv

Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.

preprint2020arXiv

Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation

Non-autoregressive translation (NAT) achieves faster inference speed but at the cost of worse accuracy compared with autoregressive translation (AT). Since AT and NAT can share model structure and AT is an easier task than NAT due to the explicit dependency on previous target-side tokens, a natural idea is to gradually shift the model training from the easier AT task to the harder NAT task. To smooth the shift from AT training to NAT training, in this paper, we introduce semi-autoregressive translation (SAT) as intermediate tasks. SAT contains a hyperparameter k, and each k value defines a SAT task with different degrees of parallelism. Specially, SAT covers AT and NAT as its special cases: it reduces to AT when k = 1 and to NAT when k = N (N is the length of target sentence). We design curriculum schedules to gradually shift k from 1 to N, with different pacing functions and number of tasks trained at the same time. We called our method as task-level curriculum learning for NAT (TCL-NAT). Experiments on IWSLT14 De-En, IWSLT16 En-De, WMT14 En-De and De-En datasets show that TCL-NAT achieves significant accuracy improvements over previous NAT baselines and reduces the performance gap between NAT and AT models to 1-2 BLEU points, demonstrating the effectiveness of our proposed method.

preprint2016arXiv

Design and Analysis of Deadline and Budget Constrained Autoscaling (DBCA) Algorithm for 5G Mobile Networks

In cloud computing paradigm, virtual resource autoscaling approaches have been intensively studied recent years. Those approaches dynamically scale in/out virtual resources to adjust system performance for saving operation cost. However, designing the autoscaling algorithm for desired performance with limited budget, while considering the existing capacity of legacy network equipment, is not a trivial task. In this paper, we propose a Deadline and Budget Constrained Autoscaling (DBCA) algorithm for addressing the budget-performance tradeoff. We develop an analytical model to quantify the tradeoff and cross-validate the model by extensive simulations. The results show that the DBCA can significantly improve system performance given the budget upper-bound. In addition, the model provides a quick way to evaluate the budget-performance tradeoff and system design without wide deployment, saving on cost and time.

preprint2016arXiv

Design and Analysis of Optimal Threshold Offloading (OTO) Algorithm for LTE Femtocell/Macrocell Networks

LTE femtocells have been widely deployed to increase network capacity and to offload mobile data traffic from macrocells. While cellular users' mobility behaviors are taken into consideration, a dilemma is formed: Should a User Equipment (UE) either handover into a femtocell or keep the current connection with a macrocell? Indeed, various user mobility behaviors may incur significant signaling overhead and degrade femtocell offloading capability due to frequent handover in/out femtocells. To address this dilemma, in this paper we propose an Optimal Threshold Offloading (OTO) algorithm considering the tradeoff between the signaling overhead and femtocell offloading capability. We develop an analytical model and define two performance metrics to quantify the tradeoff. The proposed model not only models user mobility behaviors but also captures femtocell offloading benefits, and shed light on their fundamental relationship. The correctness of analytical model and simulation model are cross-validated by extensive ns2 simulations. Both analytical and simulation results demonstrate that the OTO algorithm can significantly reduce signaling overhead at the minor cost of femtocell offloading capability. The results enable wide applicability in various scenarios, and therefore, have important theoretical significance. Moreover, the analytical results provide a quick way to evaluate signaling overhead and offloading capability in LTE networks without wide deployment, saving on cost and time.

preprint2016arXiv

Example-Based Image Synthesis via Randomized Patch-Matching

Image and texture synthesis is a challenging task that has long been drawing attention in the fields of image processing, graphics, and machine learning. This problem consists of modelling the desired type of images, either through training examples or via a parametric modeling, and then generating images that belong to the same statistical origin. This work addresses the image synthesis task, focusing on two specific families of images -- handwritten digits and face images. This paper offers two main contributions. First, we suggest a simple and intuitive algorithm capable of generating such images in a unified way. The proposed approach taken is pyramidal, consisting of upscaling and refining the estimated image several times. For each upscaling stage, the algorithm randomly draws small patches from a patch database, and merges these to form a coherent and novel image with high visual quality. The second contribution is a general framework for the evaluation of the generation performance, which combines three aspects: the likelihood, the originality and the spread of the synthesized images. We assess the proposed synthesis scheme and show that the results are similar in nature, and yet different from the ones found in the training set, suggesting that true synthesis effect has been obtained.

Yi Ren

What is connected

Connect this record

See the researcher in context

Building this map preview

55 published item(s)

Generate Your Talking Avatar from Video Reference

Economics Arena for Large Language Models

Evolved Massive Stars at Low-metallicity VI. Mass-Loss Rate of Red Supergiant Stars in the Large Magellanic Cloud

A Study of Syntactic Multi-Modality in Non-Autoregressive Machine Translation

Attributable-Watermarking of Speech Generative Models

Better Supervisory Signals by Observing Learning Paths

Configuration-Aware Safe Control for Mobile Robotic Arm with Control Barrier Functions

DA$^2$ Dataset: Toward Dexterity-Aware Dual-Arm Grasping

Dependence of pulsation mode of Cepheids on metallicity

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Dust Extinction Law in Nearby Star-Resolved Galaxies. II. M33 Traced by Supergiants

Expressivity of Emergent Language is a Trade-off between Contextual Complexity and Unpredictability

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning

Improving Item Cold-start Recommendation via Model-agnostic Conditional Variational Autoencoder

Kronecker-factored Quasi-Newton Methods for Deep Learning

Learning the Beauty in Songs: Neural Singing Voice Beautifier

MIC: Model-agnostic Integrated Cross-channel Recommenders

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Revisiting Over-Smoothness in Text to Speech

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

SSR-GNNs: Stroke-based Sketch Representation with Graph Neural Networks

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual Patterns

Video-based Facial Micro-Expression Analysis: A Survey of Datasets, Features and Algorithms

Video-Guided Curriculum Learning for Spoken Video Grounding

Dust Extinction Law in Nearby Star-Resolved Galaxies. I. M31 Traced by Supergiants

Edge and sublayer degrees of freedom for phosphorene nanoribbons with twofold-degenerate edge bands via electric field

Practical Quasi-Newton Methods for Training Deep Neural Networks

The Dust Mass of Supernova Remnants in M31

The Sample of Red Supergiants in Twelve Low-Mass Galaxies of the Local Group

A Study of Non-autoregressive Model for Sequence Generation

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Compositional Languages Emerge in a Neural Iterated Learning Model

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Degrees of symmetric Grothendieck polynomials and Castelnuovo-Mumford regularity

Evolved Massive Stars at Low-metallicity II. Red Supergiant Stars in the Small Magellanic Cloud

Evolved Massive Stars at Low-metallicity III. A Source Catalog for the Large Magellanic Cloud

Fully distributed cooperation for networked uncertain mobile manipulators

Information Content of Hierarchical n-Point Polytope Functions for Quantifying and Reconstructing Disordered Systems

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Microstructure Design of Low-Melting-Point Alloy (LMPA)/ Polymer Composites for Dynamic Dry Adhesion Tuning in Soft Gripping

MultiSpeech: Multi-Speaker Text to Speech with Transformer

On Granulation and Irregular Variation of Red Supergiants

PopMAG: Pop Music Accompaniment Generation

Red Supergiants in M31 and M33 I. The Complete Sample

Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation

Design and Analysis of Deadline and Budget Constrained Autoscaling (DBCA) Algorithm for 5G Mobile Networks

Design and Analysis of Optimal Threshold Offloading (OTO) Algorithm for LTE Femtocell/Macrocell Networks

Example-Based Image Synthesis via Randomized Patch-Matching