Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
32works
0followers
19topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

32 published item(s)

preprint2026arXiv

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/

preprint2026arXiv

DHI: Leveraging Diverse Hallucination Induction for Enhanced Contrastive Factuality Control in Large Language Models

Large language models (LLMs) frequently produce inaccurate or fabricated information, known as "hallucinations," which compromises their reliability. Existing approaches often train an "Evil LLM" to deliberately generate hallucinations on curated datasets, using these induced hallucinations to guide contrastive decoding against a reliable "positive model" for hallucination mitigation. However, this strategy is limited by the narrow diversity of hallucinations induced, as Evil LLMs trained on specific error types tend to reproduce only these particular patterns, thereby restricting their overall effectiveness. To address these limitations, we propose DHI (Diverse Hallucination Induction), a novel training framework that enables the Evil LLM to generate a broader range of hallucination types without relying on pre-annotated hallucination data. DHI employs a modified loss function that down-weights the generation of specific factually correct tokens, encouraging the Evil LLM to produce diverse hallucinations at targeted positions while maintaining overall factual content. Additionally, we introduce a causal attention masking adaptation to reduce the impact of this penalization on the generation of subsequent tokens. During inference, we apply an adaptive rationality constraint that restricts contrastive decoding to tokens where the positive model exhibits high confidence, thereby avoiding unnecessary penalties on factually correct tokens. Extensive empirical results show that DHI achieves significant performance gains over other contrastive decoding-based approaches across multiple hallucination benchmarks.

preprint2026arXiv

Leveraging Verifier-Based Reinforcement Learning in Image Editing

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

preprint2026arXiv

Persistent magnitude homology on finite metric space

Magnitude homology is an emerging framework that captures the intrinsic topological and geometric features of metric spaces, demonstrating significant potential for topoplogical data analysis and geometric data analysis. This work introduces persistent magnitude homology, an extension of magnitude homology that captures multi-scale geometric and topological features of metric spaces. We construct the category of finite metric spaces with isometric embeddings and show that magnitude homology defines a functor to the category of abelian groups, naturally leading to the definition of persistent magnitude homology. We also introduce weighted persistent modules and weighted barcodes to offer both an algebraic and visual description of persistent magnitude homology. Additionally, we present an isometry theorem that relates interleaving distances and bottleneck distances, and establish stability results for persistent magnitude homology and magnitude profile. These results establish the stability of magnitude-based descriptors, bridging the gap between theory and practical application.

preprint2026arXiv

SongSage: A Large Musical Language Model with Lyric Generative Pre-training

Large language models have achieved significant success in various domains, yet their understanding of lyric-centric knowledge has not been fully explored. In this work, we first introduce PlaylistSense, a dataset to evaluate the playlist understanding capability of language models. PlaylistSense encompasses ten types of user queries derived from common real-world perspectives, challenging LLMs to accurately grasp playlist features and address diverse user intents. Comprehensive evaluations indicate that current general-purpose LLMs still have potential for improvement in playlist understanding. Inspired by this, we introduce SongSage, a large musical language model equipped with diverse lyric-centric intelligence through lyric generative pretraining. SongSage undergoes continual pretraining on LyricBank, a carefully curated corpus of 5.48 billion tokens focused on lyrical content, followed by fine-tuning with LyricBank-SFT, a meticulously crafted instruction set comprising 775k samples across nine core lyric-centric tasks. Experimental results demonstrate that SongSage exhibits a strong understanding of lyric-centric knowledge, excels in rewriting user queries for zero-shot playlist recommendations, generates and continues lyrics effectively, and performs proficiently across seven additional capabilities. Beyond its lyric-centric expertise, SongSage also retains general knowledge comprehension and achieves a competitive MMLU score. We will keep the datasets inaccessible due to copyright restrictions and release the SongSage and training script to ensure reproducibility and support music AI research and applications, the datasets release plan details are provided in the appendix.

preprint2025arXiv

OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Video diffusion models (VDMs) have demonstrated remarkable capabilities in text-to-video (T2V) generation. Despite their success, VDMs still suffer from degraded image quality and flickering artifacts. To address these issues, some approaches have introduced preference learning to exploit human feedback to enhance the video generation. However, these methods primarily adopt the routine in the image domain without an in-depth investigation into video-specific preference optimization. In this paper, we reexamine the design of the video preference learning from two key aspects: feedback source and feedback tuning methodology, and present OnlineVPO, a more efficient preference learning framework tailored specifically for VDMs. On the feedback source, we found that the image-level reward model commonly used in existing methods fails to provide a human-aligned video preference signal due to the modality gap. In contrast, video quality assessment (VQA) models show superior alignment with human perception of video quality. Building on this insight, we propose leveraging VQA models as a proxy of humans to provide more modality-aligned feedback for VDMs. Regarding the preference tuning methodology, we introduce an online DPO algorithm tailored for VDMs. It not only enjoys the benefits of superior scalability in optimizing videos with higher resolution and longer duration compared with the existing method, but also mitigates the insufficient optimization issue caused by off-policy learning via online preference generation and curriculum preference update designs. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and, more importantly, scalable preference learning algorithm for video diffusion models.

preprint2023arXiv

A Comprehensive Study on Optimizing Systems with Data Processing Units

New hardware, such as SmartNICs, has been released to offload network applications in data centers. Off-path SmartNICs, a type of multi-core SoC SmartNICs, have attracted the attention of many researchers. Unfortunatelly, they lack the fully exploration of off-path SmartNICs. In this paper, we use a BlueField SmartNIC as an example to conduct a systematical study on the advantages and disadvantages of off-path SmartNICs. We make a detailed performance characterization on an off-path SmartNIC including computing power and network communication overhead, and propose the following advices: 1) Directly utilize the specific accelerators on the SmartNIC to offload applications; 2) Offload latency-insensitive background processing to the SmartNIC to reduce the load on the host; 3) Regard the SmartNIC as a new endpoint in the network to expand the computing power and storage resources of the server host; 4) Avoid directly employing the design method for systems based on on-path SmartNICs. We apply these advices to several use cases and show the performance improvements.

preprint2023arXiv

The stability of persistent homology of hypergraphs

Hypergraph is the most general model for complex networks involving group interactions. Taking the ideas of path homology from Alexander Grigor'yan, Yong Lin, Yuri Muranov and Shing-Tung Yau [18-22], Stephane Bressan, Jingyan Li and the authors of this article introduced embedded homology of hypergraphs [6] in 2019, which has leaded to successful applications in protein-ligand binding network [24, 25] in 2021. A fundamental question arising from practical applications is about the stability of the persistent embedded homology of hypergraphs. In this paper, we prove the stability of the persistent embedded homology as well as the persistent homology of the associated simplicial complex with respect to perturbations of the filtration on a hypergraph. We apply the persistent homology methods to morphisms of hypergraphs and prove the stability with respect to perturbations of the filtrations. We prove the constancy of the persistent Betti numbers under some conditions on the simple-homotopy types of hypergraphs.

preprint2022arXiv

Approaching the Fundamental Limit of Orbital Angular Momentum Multiplexing Through a Hologram Metasurface

Establishing and approaching the fundamental limit of orbital angular momentum (OAM) multiplexing are necessary and increasingly urgent for current multiple-input multiple-output research. In this work, we elaborate the fundamental limit in terms of independent scattering channels (or degrees of freedom of scattered fields) through angular-spectral analysis, in conjunction with a rigorous Green function method. The scattering channel limit is universal for arbitrary spatial mode multiplexing, which is launched by a planar electromagnetic device, such as antenna, metasurface, etc, with a predefined physical size. As a proof of concept, we demonstrate both theoretically and experimentally the limit by a metasurface hologram that transforms orthogonal OAM modes to plane-wave modes scattered at critically separated angular-spectral regions. Particularly, a minimax optimization algorithm is applied to suppress angular spectrum aliasing, achieving good performances in both full-wave simulation and experimental measurement at microwave frequencies. This work offers a theoretical upper bound and corresponding approach route for engineering designs of OAM multiplexing.

preprint2022arXiv

Maps on random hypergraphs and random simplicial complexes

Let $L$ be a simplicial complex. In this paper, we study random sub-hypergraphs and random sub-complexes of $L$. By considering the minimal complex that a sub-hypergraph can be embedded in and the maximal complex that can be embedded in a sub-hypergraph, we define some maps on the space of probability functions on sub-hypergraphs of $L$. We study the compositions of these maps as well as their actions on the space of probability functions.

preprint2022arXiv

Multi-Granularity Distillation Scheme Towards Lightweight Semi-Supervised Semantic Segmentation

Albeit with varying degrees of progress in the field of Semi-Supervised Semantic Segmentation, most of its recent successes are involved in unwieldy models and the lightweight solution is still not yet explored. We find that existing knowledge distillation techniques pay more attention to pixel-level concepts from labeled data, which fails to take more informative cues within unlabeled data into account. Consequently, we offer the first attempt to provide lightweight SSSS models via a novel multi-granularity distillation (MGD) scheme, where multi-granularity is captured from three aspects: i) complementary teacher structure; ii) labeled-unlabeled data cooperative distillation; iii) hierarchical and multi-levels loss setting. Specifically, MGD is formulated as a labeled-unlabeled data cooperative distillation scheme, which helps to take full advantage of diverse data characteristics that are essential in the semi-supervised setting. Image-level semantic-sensitive loss, region-level content-aware loss, and pixel-level consistency loss are set up to enrich hierarchical distillation abstraction via structurally complementary teachers. Experimental results on PASCAL VOC2012 and Cityscapes reveal that MGD can outperform the competitive approaches by a large margin under diverse partition protocols. For example, the performance of ResNet-18 and MobileNet-v2 backbone is boosted by 11.5% and 4.6% respectively under 1/16 partition protocol on Cityscapes. Although the FLOPs of the model backbone is compressed by 3.4-5.3x (ResNet-18) and 38.7-59.6x (MobileNetv2), the model manages to achieve satisfactory segmentation results.

preprint2022arXiv

On Modular Cohomotopy Groups

Let $p$ be a prime and let $π^n(X;\mathbb{Z}/p^r)=[X,M_n(\mathbb{Z}/p^r)]$ be the set of homotopy classes of based maps from CW-complexes $X$ into the mod $p^r$ Moore spaces $M_n(\mathbb{Z}/p^r)$ of degree $n$, where $\mathbb{Z}/p^r$ denotes the integers mod $p^r$. In this paper we firstly determine the modular cohomotopy groups $π^n(X;\mathbb{Z}/p^r)$ up to extensions by classical methods of primary cohomology operations and give conditions for the splitness of the extensions. Secondly we utilize some unstable homotopy theory of Moore spaces to study the modular cohomotopy groups; especially, the group $π^3(X;\mathbb{Z}_{(2)})$ with $\dim(X)\leq 6$ is determined.

preprint2022arXiv

On the Cayley-persistence algebra

In this paper, we introduce a persistent (co)homology theory for Cayley digraph grading. We give the algebraic structures of Cayley-persistence object. Specifically, we consider the module structure of persistent (co)homology and show the decomposition of a finitely generated Cayley-persistence module. Moreover, we introduce the persistence-cup product on the Cayley-persistence module and study the twisted structure with respect to the persistence-cup product. As an application on manifolds, we show that the persistent (co)homology is closely related to the persistent map of fundamental classes.

preprint2022arXiv

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

This paper introduces Opencpop, a publicly available high-quality Mandarin singing corpus designed for singing voice synthesis (SVS). The corpus consists of 100 popular Mandarin songs performed by a female professional singer. Audio files are recorded with studio quality at a sampling rate of 44,100 Hz and the corresponding lyrics and musical scores are provided. All singing recordings have been phonetically annotated with phoneme boundaries and syllable (note) boundaries. To demonstrate the reliability of the released data and to provide a baseline for future research, we built baseline deep neural network-based SVS models and evaluated them with both objective metrics and subjective mean opinion score (MOS) measure. Experimental results show that the best SVS model trained on our database achieves 3.70 MOS, indicating the reliability of the provided corpus. Opencpop is released to the open-source community WeNet, and the corpus, as well as synthesized demos, can be found on the project homepage.

preprint2022arXiv

Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation

Recently, Synthetic data-based Instance Segmentation has become an exceedingly favorable optimization paradigm since it leverages simulation rendering and physics to generate high-quality image-annotation pairs. In this paper, we propose a Parallel Pre-trained Transformers (PPT) framework to accomplish the synthetic data-based Instance Segmentation task. Specifically, we leverage the off-the-shelf pre-trained vision Transformers to alleviate the gap between natural and synthetic data, which helps to provide good generalization in the downstream synthetic data scene with few samples. Swin-B-based CBNet V2, SwinL-based CBNet V2 and Swin-L-based Uniformer are employed for parallel feature learning, and the results of these three models are fused by pixel-level Non-maximum Suppression (NMS) algorithm to obtain more robust results. The experimental results reveal that PPT ranks first in the CVPR2022 AVA Accessibility Vision and Autonomy Challenge, with a 65.155% mAP.

preprint2022arXiv

ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer

The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, the Scalable Vision Transformer (ScalableViT) achieves state-of-the-art performance in general-purpose vision tasks. For example, ScalableViT-S outperforms Twins-SVT-S by 1.4% and Swin-T by 1.8% on ImageNet-1K classification.

preprint2022arXiv

Soft-CP: A Credible and Effective Data Augmentation for Semantic Segmentation of Medical Lesions

The medical datasets are usually faced with the problem of scarcity and data imbalance. Moreover, annotating large datasets for semantic segmentation of medical lesions is domain-knowledge and time-consuming. In this paper, we propose a new object-blend method(short in soft-CP) that combines the Copy-Paste augmentation method for semantic segmentation of medical lesions offline, ensuring the correct edge information around the lession to solve the issue above-mentioned. We proved the method's validity with several datasets in different imaging modalities. In our experiments on the KiTS19[2] dataset, Soft-CP outperforms existing medical lesions synthesis approaches. The Soft-CP augementation provides gains of +26.5% DSC in the low data regime(10% of data) and +10.2% DSC in the high data regime(all of data), In offline training data, the ratio of real images to synthetic images is 3:1.

preprint2022arXiv

TRT-ViT: TensorRT-oriented Vision Transformer

We revisit the existing excellent Transformers from the perspective of practical application. Most of them are not even as efficient as the basic ResNets series and deviate from the realistic deployment scenario. It may be due to the current criterion to measure computation efficiency, such as FLOPs or parameters is one-sided, sub-optimal, and hardware-insensitive. Thus, this paper directly treats the TensorRT latency on the specific hardware as an efficiency metric, which provides more comprehensive feedback involving computational capacity, memory cost, and bandwidth. Based on a series of controlled experiments, this work derives four practical guidelines for TensorRT-oriented and deployment-friendly network design, e.g., early CNN and late Transformer at stage-level, early Transformer and late CNN at block-level. Accordingly, a family of TensortRT-oriented Transformers is presented, abbreviated as TRT-ViT. Extensive experiments demonstrate that TRT-ViT significantly outperforms existing ConvNets and vision Transformers with respect to the latency/accuracy trade-off across diverse visual tasks, e.g., image classification, object detection and semantic segmentation. For example, at 82.7% ImageNet-1k top-1 accuracy, TRT-ViT is 2.7$\times$ faster than CSWin and 2.0$\times$ faster than Twins. On the MS-COCO object detection task, TRT-ViT achieves comparable performance with Twins, while the inference speed is increased by 2.8$\times$.

preprint2021arXiv

Emergence of high-temperature superconductivity at the interface of two Mott insulators

Interfacial superconductivity has manifested itself in various types of heterostructures: band insulator-band insulator, band insulator-Mott insulator, and Mott insulator-metal. We report the observation of high-temperature superconductivity (HTS) in a complementary and long expected type of heterostructures, which consists of two Mott insulators, La2CuO4 (LCO) and PrBa2Cu3O7 (PBCO). By carefully controlling oxidization condition and selectively doping CuO2 planes with Fe atoms, which suppress superconductivity, we found that the superconductivity arises at the LCO side and is confined within no more than two unit cells (about 2.6 nm) near the interface. A phenomenon of overcome the Fe barrier will show up if excess oxygen is present during growth. Some possible mechanisms for the interfacial HTS have been discussed, and we attribute it to the redistribution of oxygen.

preprint2020arXiv

A Discrete Morse Theory for Hypergraphs

A hypergraph can be obtained from a simplicial complex by deleting some non-maximal simplices. By [11], a hypergraph gives an associated simplicial complex. By [4], the embedded homology of a hypergraph is the homology of the infimum chain complex, or equivalently, the homology of the supremum chain complex. In this paper, we generalize the discrete Morse theory for simplicial complexes by R. Forman [5-7] and give a discrete Morse theory for hypergraphs. We use the critical simplices of the associated simplicial complex to construct a sub-chain complex of the infimum chain complex and a sub-chain complex of the supremum chain complex, then prove that the embedded homology of a hypergraph is isomorphic to the homology of the constructed chain complexes. Moreover, we define discrete Morse functions on hypergraphs and compute the embedded homology in terms of the critical hyperedges. As by-products, we derive some Morse inequalities and collapse results for hypergraphs.

preprint2020arXiv

Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer

This paper presents a high quality singing synthesizer that is able to model a voice with limited available recordings. Based on the sequence-to-sequence singing model, we design a multi-singer framework to leverage all the existing singing data of different singers. To attenuate the issue of musical score unbalance among singers, we incorporate an adversarial task of singer classification to make encoder output less singer dependent. Furthermore, we apply multiple random window discriminators (MRWDs) on the generated acoustic features to make the network be a GAN. Both objective and subjective evaluations indicate that the proposed synthesizer can generate higher quality singing voice than baseline (4.12 vs 3.53 in MOS). Especially, the articulation of high-pitched vowels is significantly enhanced.

preprint2020arXiv

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Figure 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

preprint2020arXiv

Improved VIV response prediction using adaptive parameters and data clustering

Slender marine structures such as deep-water riser systems are continuously exposed to currents leading to vortex-induced vibrations (VIV) of the structure. This may result in amplified drag loads and fast accumulation of fatigue damage. Consequently, accurate prediction of VIV responses is of great importance for the safe design and operation of marine risers. Model tests with elastic pipes have shown that VIV responses are influenced by many structural and hydrodynamic parameters, which have not been fully modelled in present frequency domain VIV prediction tools. Traditionally, predictions have been computed using a single set of hydrodynamic parameters, often leading to inconsistent prediction accuracy when compared with observed field measurements and experimental data. Hence, it is necessary to implement a high safety factor of 10 - 20 in the riser design, which increases development cost and adds extra constraints in the field operation. One way to compensate for the simplifications in the mathematical prediction model is to apply adaptive parameters to describe different riser responses. The objective of this work is to demonstrate a new method to improve the prediction consistency and accuracy by applying adaptive hydrodynamic parameters. In the present work, a four-step approach has been proposed: First, the measured VIV response will be analysed to identify key parameters to represent the response characteristics. These parameters will be grouped using data clustering algorithms. Secondly, optimal hydrodynamic parameters will be identified for each data group by optimisation against measured data. Thirdly, the VIV response using the obtained parameters will be calculated and the prediction accuracy evaluated. The correct hydrodynamic parameters to be used for new cases can be obtained from the clustering. This concept has been demonstrated with examples from experimental data.

preprint2020arXiv

Lifting theorem for the virtual pure braid groups

In this article we prove theorem on Lifting for the set of virtual pure braid groups. This theorem says that if we know presentation of virtual pure braid group $VP_4$, then we can find presentation of $VP_n$ for arbitrary $n > 4$. Using this theorem we find the set of generators and defining relations for simplicial group $T_*$ which was defined in the previuos article of the authors. We find a decomposition of the Artin pure braid group $P_n$ in semi-direct product of free groups in the cabled generators.

preprint2020arXiv

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a \emph{Boundary Adaptive Refinement} (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.

preprint2020arXiv

Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video

Temporally language grounding in untrimmed videos is a newly-raised task in video understanding. Most of the existing methods suffer from inferior efficiency, lacking interpretability, and deviating from the human perception mechanism. Inspired by human's coarse-to-fine decision-making paradigm, we formulate a novel Tree-Structured Policy based Progressive Reinforcement Learning (TSP-PRL) framework to sequentially regulate the temporal boundary by an iterative refinement process. The semantic concepts are explicitly represented as the branches in the policy, which contributes to efficiently decomposing complex policies into an interpretable primitive action. Progressive reinforcement learning provides correct credit assignment via two task-oriented rewards that encourage mutual promotion within the tree-structured policy. We extensively evaluate TSP-PRL on the Charades-STA and ActivityNet datasets, and experimental results show that TSP-PRL achieves competitive performance over existing state-of-the-art methods.

preprint2020arXiv

Weighted Fundamental Group

In this paper, we develop and study the theory of weighted fundamental groups of weighted simplicial complexes. When all weights are 1, the weighted fundamental group reduces to the usual fundamental group as a special case. We also study weighted versions of classical theorems like van Kampen's theorem. In addition, we also investigate the abelianization, lower central series and applications of weighted fundamental groups.

preprint2020arXiv

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we add a residual connection in F0 prediction. 3) In addition to the duration loss of each phoneme, the duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement. Experiment results show that XiaoiceSing outperforms the baseline system of convolutional neural networks by 1.44 MOS on sound quality, 1.18 on pronunciation accuracy and 1.38 on naturalness respectively. In two A/B tests, the proposed F0 and duration modeling methods achieve 97.3% and 84.3% preference rate over baseline respectively, which demonstrates the overwhelming advantages of XiaoiceSing.

preprint2019arXiv

Discrete Morse Theory for Weighted Simplicial Complexes

In this paper, we study Forman's discrete Morse theory in the context of weighted homology. We develop weighted versions of classical theorems in discrete Morse theory. A key difference in the weighted case is that simplicial collapses do not necessarily preserve weighted homology. We work out some sufficient conditions for collapses to preserve weighted homology, as well as study the effect of elementary removals on weighted homology. An application to sequence analysis is included, where we study the weighted ordered complexes of sequences.

preprint2019arXiv

PMC-GANs: Generating Multi-Scale High-Quality Pedestrian with Multimodal Cascaded GANs

Recently, generative adversarial networks (GANs) have shown great advantages in synthesizing images, leading to a boost of explorations of using faked images to augment data. This paper proposes a multimodal cascaded generative adversarial networks (PMC-GANs) to generate realistic and diversified pedestrian images and augment pedestrian detection data. The generator of our model applies a residual U-net structure, with multi-scale residual blocks to encode features, and attention residual blocks to help decode and rebuild pedestrian images. The model constructs in a coarse-to-fine fashion and adopts cascade structure, which is beneficial to produce high-resolution pedestrians. PMC-GANs outperforms baselines, and when used for data augmentation, it improves pedestrian detection results.