Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
47works
0followers
33topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

47 published item(s)

preprint2026arXiv

A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables

Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.

preprint2026arXiv

Beyond Known Fakes: Generalized Detection of AI-Generated Images via Post-hoc Distribution Alignment

The rapid proliferation of highly realistic AI-generated images poses serious security threats such as misinformation and identity fraud. Detecting generated images in open-world settings is particularly challenging when they originate from unknown generators, as existing methods typically rely on model-specific artifacts and require retraining on new fake data, limiting their generalization and scalability. In this work, we propose Post-hoc Distribution Alignment (PDA), a generalized and model-agnostic framework for detecting AI-generated images under unknown generative threats. Specifically, PDA reformulates detection as a distribution alignment task by regenerating test images through a known generative model. When real images are regenerated, they inherit model-specific artifacts and align with the known fake distribution. In contrast, regenerated unknown fakes contain incompatible or mixed artifacts and remain misaligned. This difference allows an existing detector, trained on the known generative model, to accurately distinguish real images from unknown fakes without requiring access to unseen data or retraining. Extensive experiments across 16 state-of-the-art generative models, including GANs, diffusion models, and commercial text-to-image APIs (e.g., Midjourney), demonstrate that PDA achieves average detection accuracy of 96.69%, outperforming the best baseline by 10.71%. Comprehensive ablation studies and robustness analyses further confirm PDA's generalizability and resilience to distribution shifts and image transformations. Overall, our work provides a practical and scalable solution for real-world AI-generated image detection where new generative models emerge continuously.

preprint2026arXiv

Field-induced magnetic phase transitions and transport anomalies in GdAlSi

Magnetic topological materials hosting non-zero Berry curvature have emerged as a focus of intensive research due to their exceptional magnetoelectric coupling phenomena and potential applications in next-generation spintronic devices. In this work, we successfully synthesized high-quality GdAlSi single crystals, a prototypical member of RAlX (R = rare earth elements; X = Si/Ge) family that has been theoretically predicted to sustain a non-trivial Weyl semimetal state. Through systematic investigations of magnetic and transport properties, we identified two successive antiferromagnetic transitions at critical temperatures TN1 31.9 K and TN2 31.1 K, as evidenced by temperature-dependent resistivity, magnetic susceptibility, and specific heat measurements. Notably, applied magnetic fields exceeding 8 T induce a third magnetic transition (TN3), generating a cascade of metamagnetic transitions that collectively form a dendritic phase diagram. This complex magnetic behavior is attributed to the interplay between localized Gd-4f moments and itinerant conduction electrons, possibly mediated by Dzyaloshinskii-Moriya interactions. Transport measurements revealed striking stepwise anomalies in magnetoresistance when crossing phase boundaries, accompanied by pronounced hysteresis loops arising from magnetic moment flopping processes. Our results not only establish GdAlSi as a rich platform for investigating correlated topological states, but also demonstrate its potential for engineering topological phase transitions through magnetic symmetry manipulation in Weyl semimetals.

preprint2026arXiv

Pressure-Free Surface-Induced Flow by Geometric Rectification

Pressure-driven flow collapses when confined ($u\propto r^{2}$). Asymmetry rectifies surface activity (exchange or slip gradients) into axial flux at $ΔP=0$ despite zero net exchange. Lorentz reciprocity yields a projection law: throughput is the inner product of source with a geometry kernel. Signatures include inverted ``narrower-is-faster'' scaling ($u\propto r^{-1}$), leading-order viscosity independence, length amplification ($Q\propto L$), and linear superposition, defining surface-induced flow as a pressure-free Stokes-transport mode from microfluidics to physiology.

preprint2026arXiv

VidLeaks: Membership Inference Attacks Against Text-to-Video Models

The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: https://zenodo.org/records/17972831.

preprint2025arXiv

Anomalous Hall effect and rich magnetic phase diagram of Mn$_{100-x}$Rh$_{x}$ epitaxial films

A series of Mn$_{100-x}$Rh$_x$ ($20 \le x \le 50$) thin films were epitaxially grown on the MgO substrate using magnetron sputtering technique, and were systematically investigated by magnetization, longitudinal electrical resistivity, and transverse Hall resistivity. After optimizing the growth conditions, phase-pure Mn$_{100-x}$Rh$_x$ films with a cubic CsCl-type structure were obtained, and their magnetic phase diagram was built. The manipulation of Rh content leads to a rich magnetic phase diagram, where three different regimes can be identified: for $x < 40$, Mn$_{100-x}$Rh$_x$ films undergo a ferromagnetic (FM) transition below $T_\mathrm{C} \approx$ 330-350 K; for $40 \le x \le 45$, in addition to the FM transition at $T_\mathrm{C} \approx$ 200 K, Mn$_{100-x}$Rh$_x$ films undergo a FM-to-antiferromagnetic (AFM) transition at $T_\mathrm{N} \approx$ 120 K; finally for $x > 45$, only one AFM transition at $T_\mathrm{N} \approx$ 150 K can be tracked. All the Mn$_{100-x}$Rh$_x$ films exhibit distinct anomalous Hall effect in their magnetically ordered state, which is most likely due to the intrinsic Berry-curvature mechanism. In addition, all the anomalous Hall transport properties, including the resistivity, conductivity, and angle exhibit a strong correlation with the magnetic properties of Mn$_{100-x}$Rh$_x$ films, which become most evident for $x$ = 35. Our systematic investigations suggest a strong correlation between magnetic properties and electronic band topology in Mn$_{100-x}$Rh$_x$ films, highlighting their great potential for AFM spintronics.

preprint2025arXiv

HY-MT1.5 Technical Report

In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro&#39;s performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

preprint2025arXiv

Kinetic Catalysis of Spontaneous Knotting: How Free Particles Modulate Filament Entanglement

Entangled knots form spontaneously in flexible filaments, yet the influence of the surrounding environment on this process is poorly understood. Here we demonstrate that free-moving particles act as kinetic catalysts for spontaneous knotting. Through controlled agitation experiments, we find that a small number of inert beads substantially enhance the probability and accelerate the rate of knot formation. This catalytic effect is non-monotonic: an optimal particle size and concentration that maximizes entanglement, while an excess of particles suppresses knotting by impeding the filament&#39;s dynamics. We develop a stochastic model that quantitatively reproduces this behavior, attributing it to a competition between entanglement-promoting collisions and motion-suppressing drag. Our findings reveal a mechanism for tuning topological complexity, whereby adjusting these environmental agitators can either promote rapid self-assembly or inhibit unwanted entanglement. This work suggests new strategies for controlling filament topology in settings ranging from crowded biological environments to advanced materials processing.

preprint2024arXiv

Deep Learning-Based Knowledge Injection for Metaphor Detection: A Comprehensive Review

Metaphor as an advanced cognitive modality works by extracting familiar concepts in the target domain in order to understand vague and abstract concepts in the source domain. This helps humans to quickly understand and master new domains and thus adapt to changing environments. With the continuous development of metaphor research in the natural language community, many studies using knowledge-assisted models to detect textual metaphors have emerged in recent years. Compared to not using knowledge, systems that introduce various kinds of knowledge achieve greater performance gains and reach SOTA in a recent study. Based on this, the goal of this paper is to provide a comprehensive review of research advances in the application of deep learning for knowledge injection in metaphor detection tasks. We will first systematically summarize and generalize the mainstream knowledge and knowledge injection principles. Then, the datasets, evaluation metrics, and benchmark models used in metaphor detection tasks are examined. Finally, we explore the current issues facing knowledge injection methods and provide an outlook on future research directions.

preprint2023arXiv

Backdoor Attacks Against Dataset Distillation

Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.

preprint2023arXiv

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Text-to-image generation models that generate images based on prompt descriptions have attracted an increasing amount of attention during the past few months. Despite their encouraging performance, these models raise concerns about the misuse of their generated fake images. To tackle this problem, we pioneer a systematic study on the detection and attribution of fake images generated by text-to-image generation models. Concretely, we first build a machine learning classifier to detect the fake images generated by various text-to-image generation models. We then attribute these fake images to their source models, such that model owners can be held responsible for their models&#39; misuse. We further investigate how prompts that generate fake images affect detection and attribution. We conduct extensive experiments on four popular text-to-image generation models, including DALL$\cdot$E 2, Stable Diffusion, GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical results show that (1) fake images generated by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models; (2) fake images can be effectively attributed to their source models, as different models leave unique fingerprints in their generated images; (3) prompts with the ``person&#39;&#39; topic or a length between 25 and 75 enable models to generate fake images with higher authenticity. All findings contribute to the community&#39;s insight into the threats caused by text-to-image generation models. We appeal to the community&#39;s consideration of the counterpart solutions, like ours, against the rapidly-evolving fake image generation.

preprint2023arXiv

Ultrafast X-ray Diffraction Probe of Coherent Spin-state Dynamics in Molecules

We propose an approach to probe coherent spin-state dynamics of molecules using circularly polarized hard x-ray pulses. For the dynamically aligned nitric oxide molecules in a coherent superposition spin-orbit coupled electronic state that can be prepared through stimulated Raman scattering, we demonstrate the capability of ultrafast x-ray diffraction to not only reveal the quantum beating of the coherent spin-state wave packet, but also image the spatial spin density of the molecule. With circularly polarized ultrafast x-ray diffraction signal, we show that the electronic density matrix can be retrieved. The spatio-temporal resolving power of ultrafast x-ray diffraction paves the way for tracking transient spatial wave function in molecular dynamics involving spin degree of freedom.

preprint2022arXiv

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9x and 3.9x speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (<0.2% degradation)

preprint2022arXiv

Anomalous thermal Hall effect and anomalous Nernst effect of CsV$_{3}$Sb$_{5}$

Motived by time-reversal symmetry breaking and giant anomalous Hall effect in kagome superconductor \textit{A}V$_3$Sb$_5$ (\textit{A} = Cs, K, Rb), we carried out the thermal transport measurements on CsV$_3$Sb$_5$. In addition to the anomalous Hall effect, the anomalous Nernst effect and the anomalous thermal Hall effect emerge. Interestingly, the longitudinal thermal conductivity $κ_{xx}$ largely deviates from the electronic contribution obtained from the longitudinal conductivity $σ_{xx}$ by the Wiedemann-Franz law. In contrast, the thermal Hall conductivity $κ_{xy}$ is roughly consistent with the Wiedemann-Franz law from electronic contribution. All these results indicate the large phonon contribution in the longitudinal thermal conductivity. Moreover, the thermal Hall conductivity is also slightly greater than the theoretical electronic contribution, indicating other charge neutral contributions. More than that, the Nernst coefficient and Hall resistivity show the multi-band behavior with possible additional contribution from Berry curvature at the low fields.

preprint2022arXiv

Auditing Membership Leakages of Multi-Exit Networks

Relying on the fact that not all inputs require the same amount of computation to yield a confident prediction, multi-exit networks are gaining attention as a prominent approach for pushing the limits of efficient deployment. Multi-exit networks endow a backbone model with early exits, allowing to obtain predictions at intermediate layers of the model and thus save computation time and/or energy. However, current various designs of multi-exit networks are only considered to achieve the best trade-off between resource usage efficiency and prediction accuracy, the privacy risks stemming from them have never been explored. This prompts the need for a comprehensive investigation of privacy risks in multi-exit networks. In this paper, we perform the first privacy analysis of multi-exit networks through the lens of membership leakages. In particular, we first leverage the existing attack methodologies to quantify the multi-exit networks&#39; vulnerability to membership leakages. Our experimental results show that multi-exit networks are less vulnerable to membership leakages and the exit (number and depth) attached to the backbone model is highly correlated with the attack performance. Furthermore, we propose a hybrid attack that exploits the exit information to improve the performance of existing attacks. We evaluate membership leakage threat caused by our hybrid attack under three different adversarial setups, ultimately arriving at a model-free and data-free adversary. These results clearly demonstrate that our hybrid attacks are very broadly applicable, thereby the corresponding risks are much more severe than shown by existing membership inference attacks. We further present a defense mechanism called TimeGuard specifically for multi-exit networks and show that TimeGuard mitigates the newly proposed attacks perfectly.

preprint2022arXiv

Condensing Graphs via One-Step Gradient Matching

As training deep learning models on large dataset takes a lot of time and resources, it is desired to construct a small synthetic dataset with which we can train deep learning models sufficiently. There are recent works that have explored solutions on condensing image datasets through complex bi-level optimization. For instance, dataset condensation (DC) matches network gradients w.r.t. large-real data and small-synthetic data, where the network weights are optimized for multiple steps at each outer iteration. However, existing approaches have their inherent limitations: (1) they are not directly applicable to graphs where the data is discrete; and (2) the condensation process is computationally expensive due to the involved nested optimization. To bridge the gap, we investigate efficient dataset condensation tailored for graph datasets where we model the discrete graph structure as a probabilistic model. We further propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. Extensive experiments on various graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance and our method is significantly faster than multi-step gradient matching (e.g. 15x in CIFAR10 for synthesizing 500 graphs). Code is available at \url{https://github.com/amazon-research/DosCond}.

preprint2022arXiv

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization

Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks. However, such models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency. To alleviate this issue, we propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model. Empirical analyses show that, despite the challenging nature of generative tasks, we were able to achieve a 16.5x model footprint compression ratio with little performance drop relative to the full-precision counterparts on multiple summarization and QA datasets. We further pushed the limit of compression ratio to 27.7x and presented the performance-efficiency trade-off for generative tasks using pre-trained models. To the best of our knowledge, this is the first work aiming to effectively distill and quantize sequence-to-sequence pre-trained models for language generation tasks.

preprint2022arXiv

ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions

Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the &#34;rare events&#34; that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.

preprint2022arXiv

Indexing Metric Spaces for Exact Similarity Search

With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.

preprint2022arXiv

Just Enough, Just in Time, Just for &#34;Me&#34;: Fundamental Principles for Engineering IoT-native Software Systems

By seamlessly integrating everyday objects and by changing the way we interact with our surroundings, Internet of Things (IoT) is drastically improving the life quality of households and enhancing the productivity of businesses. Given the unique IoT characteristics, IoT applications have emerged distinctively from the mainstream application types. Inspired by the outlook of a programmable world, we further foresee an IoT-native trend in designing, developing, deploying, and maintaining software systems. However, although the challenges of IoT software projects are frequently discussed, addressing those challenges are still in the &#34;crossing the chasm&#34; period. By participating in a few various IoT projects, we gradually distilled three fundamental principles for engineering IoT-native software systems, such as just enough, just in time, and just for &#34;me&#34;. These principles target the challenges that are associated with the most typical features of IoT environments, ranging from resource limits to technology heterogeneity of IoT devices. We expect this research to trigger dedicated efforts, techniques and theories for the topic IoT-native software engineering.

preprint2022arXiv

Membership-Doctor: Comprehensive Assessment of Membership Inference Against Machine Learning Models

Machine learning models are prone to memorizing sensitive data, making them vulnerable to membership inference attacks in which an adversary aims to infer whether an input sample was used to train the model. Over the past few years, researchers have produced many membership inference attacks and defenses. However, these attacks and defenses employ a variety of strategies and are conducted in different models and datasets. The lack of comprehensive benchmark, however, means we do not understand the strengths and weaknesses of existing attacks and defenses. We fill this gap by presenting a large-scale measurement of different membership inference attacks and defenses. We systematize membership inference through the study of nine attacks and six defenses and measure the performance of different attacks and defenses in the holistic evaluation. We then quantify the impact of the threat model on the results of these attacks. We find that some assumptions of the threat model, such as same-architecture and same-distribution between shadow and target models, are unnecessary. We are also the first to execute attacks on the real-world data collected from the Internet, instead of laboratory datasets. We further investigate what determines the performance of membership inference attacks and reveal that the commonly believed overfitting level is not sufficient for the success of the attacks. Instead, the Jensen-Shannon distance of entropy/cross-entropy between member and non-member samples correlates with attack performance much better. This gives us a new way to accurately predict membership inference risks without running the attack. Finally, we find that data augmentation degrades the performance of existing attacks to a larger extent, and we propose an adaptive attack using augmentation to train shadow and attack models that improve attack performance.

preprint2022arXiv

Multilingual Knowledge Graph Completion with Self-Supervised Adaptive Graph Alignment

Predicting missing facts in a knowledge graph (KG) is crucial as modern KGs are far from complete. Due to labor-intensive human labeling, this phenomenon deteriorates when handling knowledge represented in various languages. In this paper, we explore multilingual KG completion, which leverages limited seed alignment as a bridge, to embrace the collective knowledge from multiple languages. However, language alignment used in prior works is still not fully exploited: (1) alignment pairs are treated equally to maximally push parallel entities to be close, which ignores KG capacity inconsistency; (2) seed alignment is scarce and new alignment identification is usually in a noisily unsupervised manner. To tackle these issues, we propose a novel self-supervised adaptive graph alignment (SS-AGA) method. Specifically, SS-AGA fuses all KGs as a whole graph by regarding alignment as a new edge type. As such, information propagation and noise influence across KGs can be adaptively controlled via relation-aware attention weights. Meanwhile, SS-AGA features a new pair generator that dynamically captures potential alignment pairs in a self-supervised paradigm. Extensive experiments on both the public multilingual DBPedia KG and newly-created industrial multilingual E-commerce KG empirically demonstrate the effectiveness of SS-AG

preprint2022arXiv

Neutrino Rocket Jet Model: An Explanation of High-velocity Pulsars and their Spin-down Evolution

The fact that the spatial velocity of pulsars is generally higher than that of their progenitor stars has bothered astronomers for nearly 50 years. It has been extensively argued that the high pulsar velocity should be acquired during a natal kick process on a timescale of 100ms - 10s in the supernova explosion, in which some asymmetrical dynamical mechanism plays a key role. However, a satisfactory picture generally is still lacking. In this study, it is argued that the neutrino rocket model can well account for the high speed as well as the long-term evolution behaviors of pulsars. The neutrinos are emitted from superfluid vortex neutrons through the neutrino cyclotron radiation mechanism. The unique characters of left-handed neutrinos and right-handed antineutrinos resulting from the nonconservation of parity in weak interactions play a major role in the spatial asymmetry. The continuous acceleration of pulsars can be naturally explained by this model, which yields a maximum velocity surpassing 1000 km s$^{-1}$. The alignment between the spinning axis and the direction of motion observed for the Crab pulsar (PSR 0531) and the Vela pulsar (PSR 0833) can be well accounted for. The observed correlation between the spin-down rate and the period of long-period pulsars with $P \gtrsim 0.5$s can also be satisfactorily explained.

preprint2022arXiv

Online Knowledge Distillation for Efficient Pose Estimation

Existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. One promising technique to obtain an accurate yet lightweight pose estimator is knowledge distillation, which distills the pose knowledge from a powerful teacher model to a less-parameterized student model. However, existing pose distillation works rely on a heavy pre-trained estimator to perform knowledge transfer and require a complex two-stage learning procedure. In this work, we investigate a novel Online Knowledge Distillation framework by distilling Human Pose structure knowledge in a one-stage manner to guarantee the distillation efficiency, termed OKDHP. Specifically, OKDHP trains a single multi-branch network and acquires the predicted heatmaps from each, which are then assembled by a Feature Aggregation Unit (FAU) as the target heatmaps to teach each branch in reverse. Instead of simply averaging the heatmaps, FAU which consists of multiple parallel transformations with different receptive fields, leverages the multi-scale information, thus obtains target heatmaps with higher-quality. Specifically, the pixel-wise Kullback-Leibler (KL) divergence is utilized to minimize the discrepancy between the target heatmaps and the predicted ones, which enables the student network to learn the implicit keypoint relationship. Besides, an unbalanced OKDHP scheme is introduced to customize the student networks with different compression rates. The effectiveness of our approach is demonstrated by extensive experiments on two common benchmark datasets, MPII and COCO.

preprint2022arXiv

Opportunities of Hybrid Model-based Reinforcement Learning for Cell Therapy Manufacturing Process Control

Driven by the key challenges of cell therapy manufacturing, including high complexity, high uncertainty, and very limited process observations, we propose a hybrid model-based reinforcement learning (RL) to efficiently guide process control. We first create a probabilistic knowledge graph (KG) hybrid model characterizing the risk- and science-based understanding of biomanufacturing process mechanisms and quantifying inherent stochasticity, e.g., batch-to-batch variation. It can capture the key features, including nonlinear reactions, nonstationary dynamics, and partially observed state. This hybrid model can leverage existing mechanistic models and facilitate learning from heterogeneous process data. A computational sampling approach is used to generate posterior samples quantifying model uncertainty. Then, we introduce hybrid model-based Bayesian RL, accounting for both inherent stochasticity and model uncertainty, to guide optimal, robust, and interpretable dynamic decision making. Cell therapy manufacturing examples are used to empirically demonstrate that the proposed framework can outperform the classical deterministic mechanistic model assisted process optimization.

preprint2022arXiv

Perspective: Ultrafast Imaging of Molecular Dynamics Using Ultrafast Low-Frequency Lasers, X-ray Free Electron Laser and Electron Pulses

The requirement of high space-time resolution and brightness is a great challenge for imaging atomic motion and making molecular movies. Important breakthroughs in ultrabright tabletop laser, x-ray and electron sources have enabled the direct imaging of evolving molecular structures in chemical processes. And recent experimental advances in preparing ultrafast laser and electron pulses equipped molecular imaging with femtosecond time resolution. This Perspectives present an overview of versatile imaging methods of molecular dynamics. High-order harmonic generation imaging and photoelectron diffraction imaging are based on laser-induced ionization and rescattering processes. Coulomb explosion imaging retrieves molecular structural information by detecting the momentum vectors of fragmented ions. Diffraction imaging encodes molecular structural and electronic information in reciprocal space. We also present various applications of these ultrafast imaging methods in resolving laser-induced nuclear and electronic dynamics.

preprint2022arXiv

RETE: Retrieval-Enhanced Temporal Event Forecasting on Unified Query Product Evolutionary Graph

With the increasing demands on e-commerce platforms, numerous user action history is emerging. Those enriched action records are vital to understand users&#39; interests and intents. Recently, prior works for user behavior prediction mainly focus on the interactions with product-side information. However, the interactions with search queries, which usually act as a bridge between users and products, are still under investigated. In this paper, we explore a new problem named temporal event forecasting, a generalized user behavior prediction task in a unified query product evolutionary graph, to embrace both query and product recommendation in a temporal manner. To fulfill this setting, there involves two challenges: (1) the action data for most users is scarce; (2) user preferences are dynamically evolving and shifting over time. To tackle those issues, we propose a novel Retrieval-Enhanced Temporal Event (RETE) forecasting framework. Unlike existing methods that enhance user representations via roughly absorbing information from connected entities in the whole graph, RETE efficiently and dynamically retrieves relevant entities centrally on each user as high-quality subgraphs, preventing the noise propagation from the densely evolutionary graph structures that incorporate abundant search queries. And meanwhile, RETE autoregressively accumulates retrieval-enhanced user representations from each time step, to capture evolutionary patterns for joint query and product prediction. Empirically, extensive experiments on both the public benchmark and four real-world industrial datasets demonstrate the effectiveness of the proposed RETE method.

preprint2022arXiv

Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training

Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text. Despite its recent flourishing, keyphrase generation on non-English languages haven&#39;t been vastly investigated. In this paper, we call attention to a new setting named multilingual keyphrase generation and we contribute two new datasets, EcommerceMKP and AcademicMKP, covering six languages. Technically, we propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages. The retrieval-augmented model leverages keyphrase annotations in English datasets to facilitate generating keyphrases in low-resource languages. Given a non-English passage, a cross-lingual dense passage retrieval module finds relevant English passages. Then the associated English keyphrases serve as external knowledge for keyphrase generation in the current language. Moreover, we develop a retriever-generator iterative training algorithm to mine pseudo parallel passage pairs to strengthen the cross-lingual passage retriever. Comprehensive experiments and ablations show that the proposed approach outperforms all baselines.

preprint2022arXiv

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the benefits of runtime pruning of elements with low attention scores, the quadratic complexity of self-attention mechanisms and their on-chip memory capacity demands are overlooked. This work addresses these constraints by architecting an accelerator, called SPRINT, which leverages the inherent parallelism of ReRAM crossbar arrays to compute attention scores in an approximate manner. Our design prunes the low attention scores using a lightweight analog thresholding circuitry within ReRAM, enabling SPRINT to fetch only a small subset of relevant data to on-chip memory. To mitigate potential negative repercussions for model accuracy, SPRINT re-computes the attention scores for the few fetched data in digital. The combined in-memory pruning and on-chip recompute of the relevant attention scores enables SPRINT to transform quadratic complexity to a merely linear one. In addition, we identify and leverage a dynamic spatial locality between the adjacent attention operations even after pruning, which eliminates costly yet redundant data fetches. We evaluate our proposed technique on a wide range of state-of-the-art transformer models. On average, SPRINT yields 7.5x speedup and 19.6x energy reduction when total 16KB on-chip memory is used, while virtually on par with iso-accuracy of the baseline models (on average 0.36% degradation).

preprint2022arXiv

Towards Reproducible Evaluations for Flying Drone Controllers in Virtual Environments

Research attention on natural user interfaces (NUIs) for drone flights are rising. Nevertheless, NUIs are highly diversified, and primarily evaluated by different physical environments leading to hard-to-compare performance between such solutions. We propose a virtual environment, namely VRFlightSim, enabling comparative evaluations with enriched drone flight details to address this issue. We first replicated a state-of-the-art (SOTA) interface and designed two tasks (crossing and pointing) in our virtual environment. Then, two user studies with 13 participants demonstrate the necessity of VRFlightSim and further highlight the potential of open-data interface designs.

preprint2021arXiv

Design and Control of a Highly Redundant Rigid-Flexible Coupling Robot to Assist the COVID-19 Oropharyngeal-Swab Sampling

The outbreak of novel coronavirus pneumonia (COVID-19) has caused mortality and morbidity worldwide. Oropharyngeal-swab (OP-swab) sampling is widely used for the diagnosis of COVID-19 in the world. To avoid the clinical staff from being affected by the virus, we developed a 9-degree-of-freedom (DOF) rigid-flexible coupling (RFC) robot to assist the COVID-19 OP-swab sampling. This robot is composed of a visual system, UR5 robot arm, micro-pneumatic actuator and force-sensing system. The robot is expected to reduce risk and free up the clinical staff from the long-term repetitive sampling work. Compared with a rigid sampling robot, the developed force-sensing RFC robot can facilitate OP-swab sampling procedures in a safer and softer way. In addition, a varying-parameter zeroing neural network-based optimization method is also proposed for motion planning of the 9-DOF redundant manipulator. The developed robot system is validated by OP-swab sampling on both oral cavity phantoms and volunteers.

preprint2021arXiv

Long Live The Image: Container-Native Data Persistence in Production

Containerization plays a crucial role in the de facto technology stack for implementing microservices architecture (each microservice has its own database in most cases). Nevertheless, there are still fierce debates on containerizing production databases, mainly due to the data persistence issues and concerns. Driven by a project of refactoring an Automated Machine Learning system, this research proposes the container-native data persistence as a conditional solution to running database containers in production. In essence, the proposed solution distinguishes the stateless data access (i.e. reading) from the stateful data processing (i.e. creating, updating, and deleting) in databases. A master database handles the stateful data processing and dumps database copies for building container images, while the database containers will keep stateless at runtime, based on the preloaded dump in the image. Although there are delays in the state/image update propagation, this solution is particularly suitable for the read-only, the eventual consistency, and the asynchronous processing scenarios. Moreover, with optimal tuning (e.g., disabling locking), the portability and performance gains of a read-only database container would outweigh the performance loss in accessing data across the underlying image layers.

preprint2021arXiv

On a Factorial Knowledge Architecture for Data Science-powered Software Engineering

Given the data-intensive and collaborative trend in science, the software engineering community also pays increasing attention to obtaining valuable and useful insights from data repositories. Nevertheless, applying data science to software engineering (e.g., mining software repositories) can be blindfold and meaningless, if lacking a suitable knowledge architecture (KA). By observing that software engineering practices are generally recorded through a set of factors (e.g., programmer capacity, different environmental conditions, etc.) involved in various software project aspects, we propose a factor-based hierarchical KA of software engineering to help maximize the value of software repositories and inspire future software data-driven studies. In particular, it is the organized factors and their relationships that help guide software engineering knowledge mining, while the mined knowledge will in turn be indexed/managed through the relevant factors and their interactions. This paper explains our idea about the factorial KA and concisely demonstrates a KA component, i.e. the early-version KA of software product engineering. Once fully scoped, this proposed KA will supplement the well-known SWEBOK in terms of both the factor-centric knowledge management and the coverage/implication of potential software engineering knowledge.

preprint2021arXiv

Reconstruction of Quantitative Susceptibility Maps from Phase of Susceptibility Weighted Imaging with Cross-Connected $Ψ$-Net

Quantitative Susceptibility Mapping (QSM) is a new phase-based technique for quantifying magnetic susceptibility. The existing QSM reconstruction methods generally require complicated pre-processing on high-quality phase data. In this work, we propose to explore a new value of the high-pass filtered phase data generated in susceptibility weighted imaging (SWI), and develop an end-to-end Cross-connected $Ψ$-Net (C$Ψ$-Net) to reconstruct QSM directly from these phase data in SWI without additional pre-processing. C$Ψ$-Net adds an intermediate branch in the classical U-Net to form a $Ψ$-like structure. The specially designed dilated interaction block is embedded in each level of this branch to enlarge the receptive fields for capturing more susceptibility information from a wider spatial range of phase images. Moreover, the crossed connections are utilized between branches to implement a multi-resolution feature fusion scheme, which helps C$Ψ$-Net capture rich contextual information for accurate reconstruction. The experimental results on a human dataset show that C$Ψ$-Net achieves superior performance in our task over other QSM reconstruction algorithms.

preprint2021arXiv

Selective quantum Zeno effect of ultracold atom-molecule scattering in dynamic magnetic fields

We demonstrated that final states of ultracold scattering between atom and molecule can be selectively produced using dynamic magnetic fields of multiple frequencies. The mechanism of the dynamic magnetic field control is based on a generalized quantum Zeno effect for the selected scattering channels. In particular, we use an atom-molecule spin flip scattering to show that the transition to the selected final spin projection of the molecule in the inelastic scattering can be suppressed by dynamic modulation of coupling between the Floquet engineered initial and final states.

preprint2021arXiv

Stop Building Castles on a Swamp! The Crisis of Reproducing Automatic Search in Evidence-based Software Engineering

The evidence-based approach has increasingly been employed to synthesize empirical findings from the primary research in software engineering. Nevertheless, the reproducibility of evidence-based software engineering (EBSE) studies seems to be underemphasized. In our investigation into the automatic search of 311 sample studies, more than 50% of the search strings are not reusable; about 87.5% of the search activities (e.g., search field settings) are unrepeatable; and more than 95% of the whole automatic search implementations are unreproducible. Considering that searching is a cornerstone of an EBSE study, we are afraid that the reproducibility of the current secondary research could be worse than we can imagine. By analyzing and reporting the root causes of the aforementioned observations, we urge collaboration and cooperation among all the stakeholders in our community to improve the research reproducibility in EBSE.

preprint2021arXiv

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised, heterogeneous models (i.e., different algorithms with varying hyperparameters) for further combination and analysis, rather than relying on a single model. How to accelerate the training and scoring on new-coming samples by outlyingness (referred as prediction throughout the paper) with a large number of unsupervised, heterogeneous OD models? In this study, we propose a modular acceleration system, called SUOD, to address it. The proposed system focuses on three complementary acceleration aspects (data reduction for high-dimensional data, approximation for costly models, and taskload imbalance optimization for distributed environment), while maintaining performance accuracy. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD&#39;s effectiveness in heterogeneous OD acceleration, along with a real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm. We open-source SUOD for reproducibility and accessibility.

preprint2021arXiv

Superresolving second-order correlation imaging using synthesized colored noise speckles

We present a novel method to synthesize non-trivial speckles that can enable superresolving second-order correlation imaging. The speckles acquire a unique anti-correlation in the spatial intensity fluctuation by introducing the blue noise spectrum to the input light fields through amplitude modulation. Illuminating objects with the blue noise speckle patterns can lead to a sub-diffraction limit imaging system with a resolution more than three times higher than first-order imaging, which is comparable to the resolving power of ninth order correlation imaging with thermal light. Our method opens a new route towards non-trivial speckle generation by tailoring amplitudes of the input light fields and provides a versatile scheme for constructing superresolving imaging and microscopy systems without invoking complicated higher-order correlations.

preprint2021arXiv

Tri-Hexagonal charge order in kagome metal CsV$_{3}$Sb$_{5}$ revealed by $^{121}$Sb NQR

We report $^{121}$Sb nuclear quadrupole resonance (NQR) measurements on kagome superconductor CsV$_3$Sb$_5$ with $T_{\rm c}=2.5$ K. $^{121}$Sb NQR spectra split after a charge density wave (CDW) transition at $94$ K, which demonstrates a commensurate CDW state. The coexistence of the high temperature phase and the CDW phase between $91$ K and $94$ K manifests that it is a first order phase transition. The CDW order exhibits Tri-Hexagonal deformation with a lateral shift between the adjacent kagome layers, which is consistent with $2 \times 2 \times 2$ superlattice modulation. The superconducting state coexists with CDW order and shows a conventional s-wave behavior in the bulk state.

preprint2020arXiv

A Dynamic Subspace Based BFGS Method for Large Scale Optimization Problem

Large-scale unconstrained optimization is a fundamental and important class of, yet not well-solved problems in numerical optimization. The main challenge in designing an algorithm is to require a few storage locations or very inexpensive computations while preserving global convergence. In this work, we propose a novel approach solving large-scale unconstrained optimization problem by combining the dynamic subspace technique and the BFGS update algorithm. It is clearly demonstrated that our approach has the same rate of convergence in the dynamic subspace as the BFGS and less memory than L-BFGS. Further, we give the convergence analysis by constructing the mapping of low-dimensional Euclidean space to the adaptive subspace. We compare our hybrid algorithm with the BFGS and L-BFGS approaches. Experimental results show that our hybrid algorithm offers several significant advantages such as parallel computing, convergence efficiency, and robustness.

preprint2020arXiv

COPOD: Copula-Based Outlier Detection

Outlier detection refers to the identification of rare items that are deviant from the general data distribution. Existing approaches suffer from high computational complexity, low predictive capability, and limited interpretability. As a remedy, we present a novel outlier detection algorithm called COPOD, which is inspired by copulas for modeling multivariate data distribution. COPOD first constructs an empirical copula, and then uses it to predict tail probabilities of each given data point to determine its level of &#34;extremeness&#34;. Intuitively, we think of this as calculating an anomalous p-value. This makes COPOD both parameter-free, highly interpretable, and computationally efficient. In this work, we make three key contributions, 1) propose a novel, parameter-free outlier detection algorithm with both great performance and interpretability, 2) perform extensive experiments on 30 benchmark datasets to show that COPOD outperforms in most cases and is also one of the fastest algorithms, and 3) release an easy-to-use Python implementation for reproducibility.

preprint2020arXiv

Exploiting Visual Semantic Reasoning for Video-Text Retrieval

Video retrieval is a challenging research topic bridging the vision and language areas and has attracted broad attention in recent years. Previous works have been devoted to representing videos by directly encoding from frame-level features. In fact, videos consist of various and abundant semantic relations to which existing methods pay less attention. To address this issue, we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions. Specifically, we consider frame regions as vertices and construct a fully-connected semantic correlation graph. Then, we perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations. With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed. Finally, the region features are aggregated to form frame-level features for further encoding to measure video-text similarity. Extensive experiments on two public benchmark datasets validate the effectiveness of our method by achieving state-of-the-art performance due to the powerful semantic reasoning.

preprint2020arXiv

Photoinduced Vibrations Drive Ultrafast Structural Distortion in Lead Halide Perovskite

Organic-inorganic perovskites have shown great promise towards their application in optoelectronics. The success of this class of material is dictated by the complex interplay between various underlying microscopic phenomena. The structural dynamics of organic cations and the inorganic sublattice after photoexcitation is hypothesized to have a direct effect on the material properties, thereby affecting the overall device performance. Here, we use two-dimensional (2D) electronic spectroscopy to reveal impulsively excited vibrational modes of methylammonium (MA) lead iodide perovskite, which drive the structural distortion after photoexcitation. The vibrational analysis of the measured data allows us to directly monitor the time evolution of the librational motion of the MA cation along with the vibrational coherences of inorganic sublattice. Wavelet analysis of the observed vibrational coherences uncovers the interplay between these two types of phonons. It reveals the coherent generation of the librational motion of the MA cation within ~300 fs, which is complemented by the coherent evolution of the skeletal motion of the inorganic sublattice. We have employed time-dependent density functional theory (TDDFT) to study the atomic motion of the MA cation and the inorganic sublattice during the process of photoexcitation. The TDDFT calculations support our experimental observations of the coherent generation of librational motions in the MA cation and highlight the importance of the anharmonic interaction between the MA cation and the inorganic sublattice. Our calculations predict the transfer of the photoinduced vibrational coherence from the MA cation to the inorganic sublattice, which drives the skeleton motion to form a polaronic state leading to long lifetimes of the charge carriers. This work may lead to novel design principles for next generation of solar cell materials.

preprint2020arXiv

Rate Splitting for Multi-Antenna Downlink: Precoder Design and Practical Implementation

Rate splitting (RS) is a potentially powerful and flexible technique for multi-antenna downlink transmission. In this paper, we address several technical challenges towards its practical implementation for beyond 5G systems. To this end, we focus on a single-cell system with a multi-antenna base station (BS) and K single-antenna receivers. We consider RS in its most general form, and joint decoding to fully exploit the potential of RS. First, we investigate the achievable rates under joint decoding and formulate the precoder design problems to maximize a general utility function, or to minimize the transmit power under pre-defined rate targets. Building upon the concave-convex procedure (CCCP), we propose precoder design algorithms for an arbitrary number of users. Our proposed algorithms approximate the intractable non-convex problems with a number of successively refined convex problems, and provably converge to stationary points of the original problems. Then, to reduce the decoding complexity, we consider the optimization of the precoder and the decoding order under successive decoding. Further, we propose a stream selection algorithm to reduce the number of precoded signals. With a reduced number of streams and successive decoding at the receivers, our proposed algorithm can even be implemented when the number of users is relatively large, whereas the complexity was previously considered as prohibitively high in the same setting. Finally, we propose a simple adaptation of our algorithms to account for the imperfection of the channel state information at the transmitter. Numerical results demonstrate that the general RS scheme provides a substantial performance gain as compared to state-of-the-art linear precoding schemes, especially with a moderately large number of users.

preprint2020arXiv

Research on Annotation Rules and Recognition Algorithm Based on Phrase Window

At present, most Natural Language Processing technology is based on the results of Word Segmentation for Dependency Parsing, which mainly uses an end-to-end method based on supervised learning. There are two main problems with this method: firstly, the la-beling rules are complex and the data is too difficult to label, the workload of which is large; secondly, the algorithm cannot recognize the multi-granularity and diversity of language components. In order to solve these two problems, we propose labeling rules based on phrase windows, and designed corresponding phrase recognition algorithms. The labeling rule uses phrases as the minimum unit, di-vides sentences into 7 types of nestable phrase types, and marks the grammatical dependencies between phrases. The corresponding algorithm, drawing on the idea of identifying the target area in the image field, can find the start and end positions of various phrases in the sentence, and realize the synchronous recognition of nested phrases and grammatical dependencies. The results of the experiment shows that the labeling rule is convenient and easy to use, and there is no ambiguity; the algorithm is more grammatically multi-granular and diverse than the end-to-end algorithm. Experiments on the CPWD dataset improve the accuracy of the end-to-end method by about 1 point. The corresponding method was applied to the CCL2018 competition, and the first place in the Chinese Metaphor Sentiment Analysis Task.

preprint2020arXiv

Research on multi-dimensional end-to-end phrase recognition algorithm based on background knowledge

At present, the deep end-to-end method based on supervised learning is used in entity recognition and dependency analysis. There are two problems in this method: firstly, background knowledge cannot be introduced; secondly, multi granularity and nested features of natural language cannot be recognized. In order to solve these problems, the annotation rules based on phrase window are proposed, and the corresponding multi-dimensional end-to-end phrase recognition algorithm is designed. This annotation rule divides sentences into seven types of nested phrases, and indicates the dependency between phrases. The algorithm can not only introduce background knowledge, recognize all kinds of nested phrases in sentences, but also recognize the dependency between phrases. The experimental results show that the annotation rule is easy to use and has no ambiguity; the matching algorithm is more consistent with the multi granularity and diversity characteristics of syntax than the traditional end-to-end algorithm. The experiment on CPWD dataset, by introducing background knowledge, the new algorithm improves the accuracy of the end-to-end method by more than one point. The corresponding method was applied to the CCL 2018 competition and won the first place in the task of Chinese humor type recognition.

preprint2020arXiv

SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources

A synthetic dataset is a data object that is generated programmatically, and it may be valuable to creating a single dataset from multiple sources when direct collection is difficult or costly. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In this paper, we study a specific synthetic data generation task called downscaling, a procedure to infer high-resolution, harder-to-collect information (e.g., individual level records) from many low-resolution, easy-to-collect sources, and propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula). For given low-resolution datasets, the central idea of SYNC is to fit Gaussian copula models to each of the low-resolution datasets in order to correctly capture dependencies and marginal distributions, and then sample from the fitted models to obtain the desired high-resolution subsets. Predictive models are then used to merge sampled subsets into one, and finally, sampled datasets are scaled according to low-resolution marginal constraints. We make four key contributions in this work: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of-the-art machine learning and statistical techniques, 2) perform simulation studies to validate SYNC&#39;s performance as a synthetic data generation algorithm, 3) demonstrate its value as a feature engineering tool, as well as an alternative to data collection in situations where gathering is difficult through two real-world datasets, 4) release an easy-to-use framework implementation for reproducibility and scalability at the production level that easily incorporates new data.