Source author record

Zheng Li

Zheng Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

80works

41topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables

Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.

preprint2026arXiv

Beyond Known Fakes: Generalized Detection of AI-Generated Images via Post-hoc Distribution Alignment

The rapid proliferation of highly realistic AI-generated images poses serious security threats such as misinformation and identity fraud. Detecting generated images in open-world settings is particularly challenging when they originate from unknown generators, as existing methods typically rely on model-specific artifacts and require retraining on new fake data, limiting their generalization and scalability. In this work, we propose Post-hoc Distribution Alignment (PDA), a generalized and model-agnostic framework for detecting AI-generated images under unknown generative threats. Specifically, PDA reformulates detection as a distribution alignment task by regenerating test images through a known generative model. When real images are regenerated, they inherit model-specific artifacts and align with the known fake distribution. In contrast, regenerated unknown fakes contain incompatible or mixed artifacts and remain misaligned. This difference allows an existing detector, trained on the known generative model, to accurately distinguish real images from unknown fakes without requiring access to unseen data or retraining. Extensive experiments across 16 state-of-the-art generative models, including GANs, diffusion models, and commercial text-to-image APIs (e.g., Midjourney), demonstrate that PDA achieves average detection accuracy of 96.69%, outperforming the best baseline by 10.71%. Comprehensive ablation studies and robustness analyses further confirm PDA's generalizability and resilience to distribution shifts and image transformations. Overall, our work provides a practical and scalable solution for real-world AI-generated image detection where new generative models emerge continuously.

preprint2026arXiv

Field-induced magnetic phase transitions and transport anomalies in GdAlSi

Magnetic topological materials hosting non-zero Berry curvature have emerged as a focus of intensive research due to their exceptional magnetoelectric coupling phenomena and potential applications in next-generation spintronic devices. In this work, we successfully synthesized high-quality GdAlSi single crystals, a prototypical member of RAlX (R = rare earth elements; X = Si/Ge) family that has been theoretically predicted to sustain a non-trivial Weyl semimetal state. Through systematic investigations of magnetic and transport properties, we identified two successive antiferromagnetic transitions at critical temperatures TN1 31.9 K and TN2 31.1 K, as evidenced by temperature-dependent resistivity, magnetic susceptibility, and specific heat measurements. Notably, applied magnetic fields exceeding 8 T induce a third magnetic transition (TN3), generating a cascade of metamagnetic transitions that collectively form a dendritic phase diagram. This complex magnetic behavior is attributed to the interplay between localized Gd-4f moments and itinerant conduction electrons, possibly mediated by Dzyaloshinskii-Moriya interactions. Transport measurements revealed striking stepwise anomalies in magnetoresistance when crossing phase boundaries, accompanied by pronounced hysteresis loops arising from magnetic moment flopping processes. Our results not only establish GdAlSi as a rich platform for investigating correlated topological states, but also demonstrate its potential for engineering topological phase transitions through magnetic symmetry manipulation in Weyl semimetals.

preprint2026arXiv

Pressure-Free Surface-Induced Flow by Geometric Rectification

Pressure-driven flow collapses when confined ($u\propto r^{2}$). Asymmetry rectifies surface activity (exchange or slip gradients) into axial flux at $ΔP=0$ despite zero net exchange. Lorentz reciprocity yields a projection law: throughput is the inner product of source with a geometry kernel. Signatures include inverted ``narrower-is-faster'' scaling ($u\propto r^{-1}$), leading-order viscosity independence, length amplification ($Q\propto L$), and linear superposition, defining surface-induced flow as a pressure-free Stokes-transport mode from microfluidics to physiology.

preprint2026arXiv

VidLeaks: Membership Inference Attacks Against Text-to-Video Models

The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: https://zenodo.org/records/17972831.

preprint2025arXiv

Anomalous Hall effect and rich magnetic phase diagram of Mn$_{100-x}$Rh$_{x}$ epitaxial films

A series of Mn$_{100-x}$Rh$_x$ ($20 \le x \le 50$) thin films were epitaxially grown on the MgO substrate using magnetron sputtering technique, and were systematically investigated by magnetization, longitudinal electrical resistivity, and transverse Hall resistivity. After optimizing the growth conditions, phase-pure Mn$_{100-x}$Rh$_x$ films with a cubic CsCl-type structure were obtained, and their magnetic phase diagram was built. The manipulation of Rh content leads to a rich magnetic phase diagram, where three different regimes can be identified: for $x < 40$, Mn$_{100-x}$Rh$_x$ films undergo a ferromagnetic (FM) transition below $T_\mathrm{C} \approx$ 330-350 K; for $40 \le x \le 45$, in addition to the FM transition at $T_\mathrm{C} \approx$ 200 K, Mn$_{100-x}$Rh$_x$ films undergo a FM-to-antiferromagnetic (AFM) transition at $T_\mathrm{N} \approx$ 120 K; finally for $x > 45$, only one AFM transition at $T_\mathrm{N} \approx$ 150 K can be tracked. All the Mn$_{100-x}$Rh$_x$ films exhibit distinct anomalous Hall effect in their magnetically ordered state, which is most likely due to the intrinsic Berry-curvature mechanism. In addition, all the anomalous Hall transport properties, including the resistivity, conductivity, and angle exhibit a strong correlation with the magnetic properties of Mn$_{100-x}$Rh$_x$ films, which become most evident for $x$ = 35. Our systematic investigations suggest a strong correlation between magnetic properties and electronic band topology in Mn$_{100-x}$Rh$_x$ films, highlighting their great potential for AFM spintronics.

preprint2025arXiv

HY-MT1.5 Technical Report

In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro's performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

preprint2025arXiv

Kinetic Catalysis of Spontaneous Knotting: How Free Particles Modulate Filament Entanglement

Entangled knots form spontaneously in flexible filaments, yet the influence of the surrounding environment on this process is poorly understood. Here we demonstrate that free-moving particles act as kinetic catalysts for spontaneous knotting. Through controlled agitation experiments, we find that a small number of inert beads substantially enhance the probability and accelerate the rate of knot formation. This catalytic effect is non-monotonic: an optimal particle size and concentration that maximizes entanglement, while an excess of particles suppresses knotting by impeding the filament's dynamics. We develop a stochastic model that quantitatively reproduces this behavior, attributing it to a competition between entanglement-promoting collisions and motion-suppressing drag. Our findings reveal a mechanism for tuning topological complexity, whereby adjusting these environmental agitators can either promote rapid self-assembly or inhibit unwanted entanglement. This work suggests new strategies for controlling filament topology in settings ranging from crowded biological environments to advanced materials processing.

preprint2024arXiv

Deep Learning-Based Knowledge Injection for Metaphor Detection: A Comprehensive Review

Metaphor as an advanced cognitive modality works by extracting familiar concepts in the target domain in order to understand vague and abstract concepts in the source domain. This helps humans to quickly understand and master new domains and thus adapt to changing environments. With the continuous development of metaphor research in the natural language community, many studies using knowledge-assisted models to detect textual metaphors have emerged in recent years. Compared to not using knowledge, systems that introduce various kinds of knowledge achieve greater performance gains and reach SOTA in a recent study. Based on this, the goal of this paper is to provide a comprehensive review of research advances in the application of deep learning for knowledge injection in metaphor detection tasks. We will first systematically summarize and generalize the mainstream knowledge and knowledge injection principles. Then, the datasets, evaluation metrics, and benchmark models used in metaphor detection tasks are examined. Finally, we explore the current issues facing knowledge injection methods and provide an outlook on future research directions.

preprint2023arXiv

Backdoor Attacks Against Dataset Distillation

Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.

preprint2023arXiv

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Text-to-image generation models that generate images based on prompt descriptions have attracted an increasing amount of attention during the past few months. Despite their encouraging performance, these models raise concerns about the misuse of their generated fake images. To tackle this problem, we pioneer a systematic study on the detection and attribution of fake images generated by text-to-image generation models. Concretely, we first build a machine learning classifier to detect the fake images generated by various text-to-image generation models. We then attribute these fake images to their source models, such that model owners can be held responsible for their models' misuse. We further investigate how prompts that generate fake images affect detection and attribution. We conduct extensive experiments on four popular text-to-image generation models, including DALL$\cdot$E 2, Stable Diffusion, GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical results show that (1) fake images generated by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models; (2) fake images can be effectively attributed to their source models, as different models leave unique fingerprints in their generated images; (3) prompts with the ``person'' topic or a length between 25 and 75 enable models to generate fake images with higher authenticity. All findings contribute to the community's insight into the threats caused by text-to-image generation models. We appeal to the community's consideration of the counterpart solutions, like ours, against the rapidly-evolving fake image generation.

preprint2023arXiv

Ultrafast X-ray Diffraction Probe of Coherent Spin-state Dynamics in Molecules

We propose an approach to probe coherent spin-state dynamics of molecules using circularly polarized hard x-ray pulses. For the dynamically aligned nitric oxide molecules in a coherent superposition spin-orbit coupled electronic state that can be prepared through stimulated Raman scattering, we demonstrate the capability of ultrafast x-ray diffraction to not only reveal the quantum beating of the coherent spin-state wave packet, but also image the spatial spin density of the molecule. With circularly polarized ultrafast x-ray diffraction signal, we show that the electronic density matrix can be retrieved. The spatio-temporal resolving power of ultrafast x-ray diffraction paves the way for tracking transient spatial wave function in molecular dynamics involving spin degree of freedom.

preprint2022arXiv

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9x and 3.9x speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (<0.2% degradation)

preprint2022arXiv

Anomalous thermal Hall effect and anomalous Nernst effect of CsV$_{3}$Sb$_{5}$

Motived by time-reversal symmetry breaking and giant anomalous Hall effect in kagome superconductor \textit{A}V$_3$Sb$_5$ (\textit{A} = Cs, K, Rb), we carried out the thermal transport measurements on CsV$_3$Sb$_5$. In addition to the anomalous Hall effect, the anomalous Nernst effect and the anomalous thermal Hall effect emerge. Interestingly, the longitudinal thermal conductivity $κ_{xx}$ largely deviates from the electronic contribution obtained from the longitudinal conductivity $σ_{xx}$ by the Wiedemann-Franz law. In contrast, the thermal Hall conductivity $κ_{xy}$ is roughly consistent with the Wiedemann-Franz law from electronic contribution. All these results indicate the large phonon contribution in the longitudinal thermal conductivity. Moreover, the thermal Hall conductivity is also slightly greater than the theoretical electronic contribution, indicating other charge neutral contributions. More than that, the Nernst coefficient and Hall resistivity show the multi-band behavior with possible additional contribution from Berry curvature at the low fields.

preprint2022arXiv

Auditing Membership Leakages of Multi-Exit Networks

Relying on the fact that not all inputs require the same amount of computation to yield a confident prediction, multi-exit networks are gaining attention as a prominent approach for pushing the limits of efficient deployment. Multi-exit networks endow a backbone model with early exits, allowing to obtain predictions at intermediate layers of the model and thus save computation time and/or energy. However, current various designs of multi-exit networks are only considered to achieve the best trade-off between resource usage efficiency and prediction accuracy, the privacy risks stemming from them have never been explored. This prompts the need for a comprehensive investigation of privacy risks in multi-exit networks. In this paper, we perform the first privacy analysis of multi-exit networks through the lens of membership leakages. In particular, we first leverage the existing attack methodologies to quantify the multi-exit networks' vulnerability to membership leakages. Our experimental results show that multi-exit networks are less vulnerable to membership leakages and the exit (number and depth) attached to the backbone model is highly correlated with the attack performance. Furthermore, we propose a hybrid attack that exploits the exit information to improve the performance of existing attacks. We evaluate membership leakage threat caused by our hybrid attack under three different adversarial setups, ultimately arriving at a model-free and data-free adversary. These results clearly demonstrate that our hybrid attacks are very broadly applicable, thereby the corresponding risks are much more severe than shown by existing membership inference attacks. We further present a defense mechanism called TimeGuard specifically for multi-exit networks and show that TimeGuard mitigates the newly proposed attacks perfectly.

preprint2022arXiv

Condensing Graphs via One-Step Gradient Matching

As training deep learning models on large dataset takes a lot of time and resources, it is desired to construct a small synthetic dataset with which we can train deep learning models sufficiently. There are recent works that have explored solutions on condensing image datasets through complex bi-level optimization. For instance, dataset condensation (DC) matches network gradients w.r.t. large-real data and small-synthetic data, where the network weights are optimized for multiple steps at each outer iteration. However, existing approaches have their inherent limitations: (1) they are not directly applicable to graphs where the data is discrete; and (2) the condensation process is computationally expensive due to the involved nested optimization. To bridge the gap, we investigate efficient dataset condensation tailored for graph datasets where we model the discrete graph structure as a probabilistic model. We further propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. Extensive experiments on various graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance and our method is significantly faster than multi-step gradient matching (e.g. 15x in CIFAR10 for synthesizing 500 graphs). Code is available at \url{https://github.com/amazon-research/DosCond}.

preprint2022arXiv

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization

Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks. However, such models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency. To alleviate this issue, we propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model. Empirical analyses show that, despite the challenging nature of generative tasks, we were able to achieve a 16.5x model footprint compression ratio with little performance drop relative to the full-precision counterparts on multiple summarization and QA datasets. We further pushed the limit of compression ratio to 27.7x and presented the performance-efficiency trade-off for generative tasks using pre-trained models. To the best of our knowledge, this is the first work aiming to effectively distill and quantize sequence-to-sequence pre-trained models for language generation tasks.

preprint2022arXiv

ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions

Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.

preprint2022arXiv

Indexing Metric Spaces for Exact Similarity Search

With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.

preprint2022arXiv

Just Enough, Just in Time, Just for "Me": Fundamental Principles for Engineering IoT-native Software Systems

By seamlessly integrating everyday objects and by changing the way we interact with our surroundings, Internet of Things (IoT) is drastically improving the life quality of households and enhancing the productivity of businesses. Given the unique IoT characteristics, IoT applications have emerged distinctively from the mainstream application types. Inspired by the outlook of a programmable world, we further foresee an IoT-native trend in designing, developing, deploying, and maintaining software systems. However, although the challenges of IoT software projects are frequently discussed, addressing those challenges are still in the "crossing the chasm" period. By participating in a few various IoT projects, we gradually distilled three fundamental principles for engineering IoT-native software systems, such as just enough, just in time, and just for "me". These principles target the challenges that are associated with the most typical features of IoT environments, ranging from resource limits to technology heterogeneity of IoT devices. We expect this research to trigger dedicated efforts, techniques and theories for the topic IoT-native software engineering.

preprint2022arXiv

Membership-Doctor: Comprehensive Assessment of Membership Inference Against Machine Learning Models

Machine learning models are prone to memorizing sensitive data, making them vulnerable to membership inference attacks in which an adversary aims to infer whether an input sample was used to train the model. Over the past few years, researchers have produced many membership inference attacks and defenses. However, these attacks and defenses employ a variety of strategies and are conducted in different models and datasets. The lack of comprehensive benchmark, however, means we do not understand the strengths and weaknesses of existing attacks and defenses. We fill this gap by presenting a large-scale measurement of different membership inference attacks and defenses. We systematize membership inference through the study of nine attacks and six defenses and measure the performance of different attacks and defenses in the holistic evaluation. We then quantify the impact of the threat model on the results of these attacks. We find that some assumptions of the threat model, such as same-architecture and same-distribution between shadow and target models, are unnecessary. We are also the first to execute attacks on the real-world data collected from the Internet, instead of laboratory datasets. We further investigate what determines the performance of membership inference attacks and reveal that the commonly believed overfitting level is not sufficient for the success of the attacks. Instead, the Jensen-Shannon distance of entropy/cross-entropy between member and non-member samples correlates with attack performance much better. This gives us a new way to accurately predict membership inference risks without running the attack. Finally, we find that data augmentation degrades the performance of existing attacks to a larger extent, and we propose an adaptive attack using augmentation to train shadow and attack models that improve attack performance.

preprint2022arXiv

Multilingual Knowledge Graph Completion with Self-Supervised Adaptive Graph Alignment

Predicting missing facts in a knowledge graph (KG) is crucial as modern KGs are far from complete. Due to labor-intensive human labeling, this phenomenon deteriorates when handling knowledge represented in various languages. In this paper, we explore multilingual KG completion, which leverages limited seed alignment as a bridge, to embrace the collective knowledge from multiple languages. However, language alignment used in prior works is still not fully exploited: (1) alignment pairs are treated equally to maximally push parallel entities to be close, which ignores KG capacity inconsistency; (2) seed alignment is scarce and new alignment identification is usually in a noisily unsupervised manner. To tackle these issues, we propose a novel self-supervised adaptive graph alignment (SS-AGA) method. Specifically, SS-AGA fuses all KGs as a whole graph by regarding alignment as a new edge type. As such, information propagation and noise influence across KGs can be adaptively controlled via relation-aware attention weights. Meanwhile, SS-AGA features a new pair generator that dynamically captures potential alignment pairs in a self-supervised paradigm. Extensive experiments on both the public multilingual DBPedia KG and newly-created industrial multilingual E-commerce KG empirically demonstrate the effectiveness of SS-AG

preprint2022arXiv

Neutrino Rocket Jet Model: An Explanation of High-velocity Pulsars and their Spin-down Evolution

The fact that the spatial velocity of pulsars is generally higher than that of their progenitor stars has bothered astronomers for nearly 50 years. It has been extensively argued that the high pulsar velocity should be acquired during a natal kick process on a timescale of 100ms - 10s in the supernova explosion, in which some asymmetrical dynamical mechanism plays a key role. However, a satisfactory picture generally is still lacking. In this study, it is argued that the neutrino rocket model can well account for the high speed as well as the long-term evolution behaviors of pulsars. The neutrinos are emitted from superfluid vortex neutrons through the neutrino cyclotron radiation mechanism. The unique characters of left-handed neutrinos and right-handed antineutrinos resulting from the nonconservation of parity in weak interactions play a major role in the spatial asymmetry. The continuous acceleration of pulsars can be naturally explained by this model, which yields a maximum velocity surpassing 1000 km s$^{-1}$. The alignment between the spinning axis and the direction of motion observed for the Crab pulsar (PSR 0531) and the Vela pulsar (PSR 0833) can be well accounted for. The observed correlation between the spin-down rate and the period of long-period pulsars with $P \gtrsim 0.5$s can also be satisfactorily explained.

preprint2022arXiv

Online Knowledge Distillation for Efficient Pose Estimation

Existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. One promising technique to obtain an accurate yet lightweight pose estimator is knowledge distillation, which distills the pose knowledge from a powerful teacher model to a less-parameterized student model. However, existing pose distillation works rely on a heavy pre-trained estimator to perform knowledge transfer and require a complex two-stage learning procedure. In this work, we investigate a novel Online Knowledge Distillation framework by distilling Human Pose structure knowledge in a one-stage manner to guarantee the distillation efficiency, termed OKDHP. Specifically, OKDHP trains a single multi-branch network and acquires the predicted heatmaps from each, which are then assembled by a Feature Aggregation Unit (FAU) as the target heatmaps to teach each branch in reverse. Instead of simply averaging the heatmaps, FAU which consists of multiple parallel transformations with different receptive fields, leverages the multi-scale information, thus obtains target heatmaps with higher-quality. Specifically, the pixel-wise Kullback-Leibler (KL) divergence is utilized to minimize the discrepancy between the target heatmaps and the predicted ones, which enables the student network to learn the implicit keypoint relationship. Besides, an unbalanced OKDHP scheme is introduced to customize the student networks with different compression rates. The effectiveness of our approach is demonstrated by extensive experiments on two common benchmark datasets, MPII and COCO.

preprint2022arXiv

Opportunities of Hybrid Model-based Reinforcement Learning for Cell Therapy Manufacturing Process Control

Driven by the key challenges of cell therapy manufacturing, including high complexity, high uncertainty, and very limited process observations, we propose a hybrid model-based reinforcement learning (RL) to efficiently guide process control. We first create a probabilistic knowledge graph (KG) hybrid model characterizing the risk- and science-based understanding of biomanufacturing process mechanisms and quantifying inherent stochasticity, e.g., batch-to-batch variation. It can capture the key features, including nonlinear reactions, nonstationary dynamics, and partially observed state. This hybrid model can leverage existing mechanistic models and facilitate learning from heterogeneous process data. A computational sampling approach is used to generate posterior samples quantifying model uncertainty. Then, we introduce hybrid model-based Bayesian RL, accounting for both inherent stochasticity and model uncertainty, to guide optimal, robust, and interpretable dynamic decision making. Cell therapy manufacturing examples are used to empirically demonstrate that the proposed framework can outperform the classical deterministic mechanistic model assisted process optimization.

preprint2022arXiv

Perspective: Ultrafast Imaging of Molecular Dynamics Using Ultrafast Low-Frequency Lasers, X-ray Free Electron Laser and Electron Pulses

The requirement of high space-time resolution and brightness is a great challenge for imaging atomic motion and making molecular movies. Important breakthroughs in ultrabright tabletop laser, x-ray and electron sources have enabled the direct imaging of evolving molecular structures in chemical processes. And recent experimental advances in preparing ultrafast laser and electron pulses equipped molecular imaging with femtosecond time resolution. This Perspectives present an overview of versatile imaging methods of molecular dynamics. High-order harmonic generation imaging and photoelectron diffraction imaging are based on laser-induced ionization and rescattering processes. Coulomb explosion imaging retrieves molecular structural information by detecting the momentum vectors of fragmented ions. Diffraction imaging encodes molecular structural and electronic information in reciprocal space. We also present various applications of these ultrafast imaging methods in resolving laser-induced nuclear and electronic dynamics.

preprint2022arXiv

RETE: Retrieval-Enhanced Temporal Event Forecasting on Unified Query Product Evolutionary Graph

With the increasing demands on e-commerce platforms, numerous user action history is emerging. Those enriched action records are vital to understand users' interests and intents. Recently, prior works for user behavior prediction mainly focus on the interactions with product-side information. However, the interactions with search queries, which usually act as a bridge between users and products, are still under investigated. In this paper, we explore a new problem named temporal event forecasting, a generalized user behavior prediction task in a unified query product evolutionary graph, to embrace both query and product recommendation in a temporal manner. To fulfill this setting, there involves two challenges: (1) the action data for most users is scarce; (2) user preferences are dynamically evolving and shifting over time. To tackle those issues, we propose a novel Retrieval-Enhanced Temporal Event (RETE) forecasting framework. Unlike existing methods that enhance user representations via roughly absorbing information from connected entities in the whole graph, RETE efficiently and dynamically retrieves relevant entities centrally on each user as high-quality subgraphs, preventing the noise propagation from the densely evolutionary graph structures that incorporate abundant search queries. And meanwhile, RETE autoregressively accumulates retrieval-enhanced user representations from each time step, to capture evolutionary patterns for joint query and product prediction. Empirically, extensive experiments on both the public benchmark and four real-world industrial datasets demonstrate the effectiveness of the proposed RETE method.

preprint2022arXiv

Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training

Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text. Despite its recent flourishing, keyphrase generation on non-English languages haven't been vastly investigated. In this paper, we call attention to a new setting named multilingual keyphrase generation and we contribute two new datasets, EcommerceMKP and AcademicMKP, covering six languages. Technically, we propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages. The retrieval-augmented model leverages keyphrase annotations in English datasets to facilitate generating keyphrases in low-resource languages. Given a non-English passage, a cross-lingual dense passage retrieval module finds relevant English passages. Then the associated English keyphrases serve as external knowledge for keyphrase generation in the current language. Moreover, we develop a retriever-generator iterative training algorithm to mine pseudo parallel passage pairs to strengthen the cross-lingual passage retriever. Comprehensive experiments and ablations show that the proposed approach outperforms all baselines.

preprint2022arXiv

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the benefits of runtime pruning of elements with low attention scores, the quadratic complexity of self-attention mechanisms and their on-chip memory capacity demands are overlooked. This work addresses these constraints by architecting an accelerator, called SPRINT, which leverages the inherent parallelism of ReRAM crossbar arrays to compute attention scores in an approximate manner. Our design prunes the low attention scores using a lightweight analog thresholding circuitry within ReRAM, enabling SPRINT to fetch only a small subset of relevant data to on-chip memory. To mitigate potential negative repercussions for model accuracy, SPRINT re-computes the attention scores for the few fetched data in digital. The combined in-memory pruning and on-chip recompute of the relevant attention scores enables SPRINT to transform quadratic complexity to a merely linear one. In addition, we identify and leverage a dynamic spatial locality between the adjacent attention operations even after pruning, which eliminates costly yet redundant data fetches. We evaluate our proposed technique on a wide range of state-of-the-art transformer models. On average, SPRINT yields 7.5x speedup and 19.6x energy reduction when total 16KB on-chip memory is used, while virtually on par with iso-accuracy of the baseline models (on average 0.36% degradation).

preprint2022arXiv

Towards Reproducible Evaluations for Flying Drone Controllers in Virtual Environments

Research attention on natural user interfaces (NUIs) for drone flights are rising. Nevertheless, NUIs are highly diversified, and primarily evaluated by different physical environments leading to hard-to-compare performance between such solutions. We propose a virtual environment, namely VRFlightSim, enabling comparative evaluations with enriched drone flight details to address this issue. We first replicated a state-of-the-art (SOTA) interface and designed two tasks (crossing and pointing) in our virtual environment. Then, two user studies with 13 participants demonstrate the necessity of VRFlightSim and further highlight the potential of open-data interface designs.

preprint2021arXiv

Design and Control of a Highly Redundant Rigid-Flexible Coupling Robot to Assist the COVID-19 Oropharyngeal-Swab Sampling

The outbreak of novel coronavirus pneumonia (COVID-19) has caused mortality and morbidity worldwide. Oropharyngeal-swab (OP-swab) sampling is widely used for the diagnosis of COVID-19 in the world. To avoid the clinical staff from being affected by the virus, we developed a 9-degree-of-freedom (DOF) rigid-flexible coupling (RFC) robot to assist the COVID-19 OP-swab sampling. This robot is composed of a visual system, UR5 robot arm, micro-pneumatic actuator and force-sensing system. The robot is expected to reduce risk and free up the clinical staff from the long-term repetitive sampling work. Compared with a rigid sampling robot, the developed force-sensing RFC robot can facilitate OP-swab sampling procedures in a safer and softer way. In addition, a varying-parameter zeroing neural network-based optimization method is also proposed for motion planning of the 9-DOF redundant manipulator. The developed robot system is validated by OP-swab sampling on both oral cavity phantoms and volunteers.

preprint2021arXiv

Long Live The Image: Container-Native Data Persistence in Production

Containerization plays a crucial role in the de facto technology stack for implementing microservices architecture (each microservice has its own database in most cases). Nevertheless, there are still fierce debates on containerizing production databases, mainly due to the data persistence issues and concerns. Driven by a project of refactoring an Automated Machine Learning system, this research proposes the container-native data persistence as a conditional solution to running database containers in production. In essence, the proposed solution distinguishes the stateless data access (i.e. reading) from the stateful data processing (i.e. creating, updating, and deleting) in databases. A master database handles the stateful data processing and dumps database copies for building container images, while the database containers will keep stateless at runtime, based on the preloaded dump in the image. Although there are delays in the state/image update propagation, this solution is particularly suitable for the read-only, the eventual consistency, and the asynchronous processing scenarios. Moreover, with optimal tuning (e.g., disabling locking), the portability and performance gains of a read-only database container would outweigh the performance loss in accessing data across the underlying image layers.

preprint2021arXiv

On a Factorial Knowledge Architecture for Data Science-powered Software Engineering

Given the data-intensive and collaborative trend in science, the software engineering community also pays increasing attention to obtaining valuable and useful insights from data repositories. Nevertheless, applying data science to software engineering (e.g., mining software repositories) can be blindfold and meaningless, if lacking a suitable knowledge architecture (KA). By observing that software engineering practices are generally recorded through a set of factors (e.g., programmer capacity, different environmental conditions, etc.) involved in various software project aspects, we propose a factor-based hierarchical KA of software engineering to help maximize the value of software repositories and inspire future software data-driven studies. In particular, it is the organized factors and their relationships that help guide software engineering knowledge mining, while the mined knowledge will in turn be indexed/managed through the relevant factors and their interactions. This paper explains our idea about the factorial KA and concisely demonstrates a KA component, i.e. the early-version KA of software product engineering. Once fully scoped, this proposed KA will supplement the well-known SWEBOK in terms of both the factor-centric knowledge management and the coverage/implication of potential software engineering knowledge.

preprint2021arXiv

Reconstruction of Quantitative Susceptibility Maps from Phase of Susceptibility Weighted Imaging with Cross-Connected $Ψ$-Net

Quantitative Susceptibility Mapping (QSM) is a new phase-based technique for quantifying magnetic susceptibility. The existing QSM reconstruction methods generally require complicated pre-processing on high-quality phase data. In this work, we propose to explore a new value of the high-pass filtered phase data generated in susceptibility weighted imaging (SWI), and develop an end-to-end Cross-connected $Ψ$-Net (C$Ψ$-Net) to reconstruct QSM directly from these phase data in SWI without additional pre-processing. C$Ψ$-Net adds an intermediate branch in the classical U-Net to form a $Ψ$-like structure. The specially designed dilated interaction block is embedded in each level of this branch to enlarge the receptive fields for capturing more susceptibility information from a wider spatial range of phase images. Moreover, the crossed connections are utilized between branches to implement a multi-resolution feature fusion scheme, which helps C$Ψ$-Net capture rich contextual information for accurate reconstruction. The experimental results on a human dataset show that C$Ψ$-Net achieves superior performance in our task over other QSM reconstruction algorithms.

preprint2021arXiv

Selective quantum Zeno effect of ultracold atom-molecule scattering in dynamic magnetic fields

We demonstrated that final states of ultracold scattering between atom and molecule can be selectively produced using dynamic magnetic fields of multiple frequencies. The mechanism of the dynamic magnetic field control is based on a generalized quantum Zeno effect for the selected scattering channels. In particular, we use an atom-molecule spin flip scattering to show that the transition to the selected final spin projection of the molecule in the inelastic scattering can be suppressed by dynamic modulation of coupling between the Floquet engineered initial and final states.

preprint2021arXiv

Stop Building Castles on a Swamp! The Crisis of Reproducing Automatic Search in Evidence-based Software Engineering

The evidence-based approach has increasingly been employed to synthesize empirical findings from the primary research in software engineering. Nevertheless, the reproducibility of evidence-based software engineering (EBSE) studies seems to be underemphasized. In our investigation into the automatic search of 311 sample studies, more than 50% of the search strings are not reusable; about 87.5% of the search activities (e.g., search field settings) are unrepeatable; and more than 95% of the whole automatic search implementations are unreproducible. Considering that searching is a cornerstone of an EBSE study, we are afraid that the reproducibility of the current secondary research could be worse than we can imagine. By analyzing and reporting the root causes of the aforementioned observations, we urge collaboration and cooperation among all the stakeholders in our community to improve the research reproducibility in EBSE.

preprint2021arXiv

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised, heterogeneous models (i.e., different algorithms with varying hyperparameters) for further combination and analysis, rather than relying on a single model. How to accelerate the training and scoring on new-coming samples by outlyingness (referred as prediction throughout the paper) with a large number of unsupervised, heterogeneous OD models? In this study, we propose a modular acceleration system, called SUOD, to address it. The proposed system focuses on three complementary acceleration aspects (data reduction for high-dimensional data, approximation for costly models, and taskload imbalance optimization for distributed environment), while maintaining performance accuracy. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD's effectiveness in heterogeneous OD acceleration, along with a real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm. We open-source SUOD for reproducibility and accessibility.

preprint2021arXiv

Superresolving second-order correlation imaging using synthesized colored noise speckles

We present a novel method to synthesize non-trivial speckles that can enable superresolving second-order correlation imaging. The speckles acquire a unique anti-correlation in the spatial intensity fluctuation by introducing the blue noise spectrum to the input light fields through amplitude modulation. Illuminating objects with the blue noise speckle patterns can lead to a sub-diffraction limit imaging system with a resolution more than three times higher than first-order imaging, which is comparable to the resolving power of ninth order correlation imaging with thermal light. Our method opens a new route towards non-trivial speckle generation by tailoring amplitudes of the input light fields and provides a versatile scheme for constructing superresolving imaging and microscopy systems without invoking complicated higher-order correlations.

preprint2021arXiv

Tri-Hexagonal charge order in kagome metal CsV$_{3}$Sb$_{5}$ revealed by $^{121}$Sb NQR

We report $^{121}$Sb nuclear quadrupole resonance (NQR) measurements on kagome superconductor CsV$_3$Sb$_5$ with $T_{\rm c}=2.5$ K. $^{121}$Sb NQR spectra split after a charge density wave (CDW) transition at $94$ K, which demonstrates a commensurate CDW state. The coexistence of the high temperature phase and the CDW phase between $91$ K and $94$ K manifests that it is a first order phase transition. The CDW order exhibits Tri-Hexagonal deformation with a lateral shift between the adjacent kagome layers, which is consistent with $2 \times 2 \times 2$ superlattice modulation. The superconducting state coexists with CDW order and shows a conventional s-wave behavior in the bulk state.

preprint2020arXiv

A Dynamic Subspace Based BFGS Method for Large Scale Optimization Problem

Large-scale unconstrained optimization is a fundamental and important class of, yet not well-solved problems in numerical optimization. The main challenge in designing an algorithm is to require a few storage locations or very inexpensive computations while preserving global convergence. In this work, we propose a novel approach solving large-scale unconstrained optimization problem by combining the dynamic subspace technique and the BFGS update algorithm. It is clearly demonstrated that our approach has the same rate of convergence in the dynamic subspace as the BFGS and less memory than L-BFGS. Further, we give the convergence analysis by constructing the mapping of low-dimensional Euclidean space to the adaptive subspace. We compare our hybrid algorithm with the BFGS and L-BFGS approaches. Experimental results show that our hybrid algorithm offers several significant advantages such as parallel computing, convergence efficiency, and robustness.

preprint2020arXiv

COPOD: Copula-Based Outlier Detection

Outlier detection refers to the identification of rare items that are deviant from the general data distribution. Existing approaches suffer from high computational complexity, low predictive capability, and limited interpretability. As a remedy, we present a novel outlier detection algorithm called COPOD, which is inspired by copulas for modeling multivariate data distribution. COPOD first constructs an empirical copula, and then uses it to predict tail probabilities of each given data point to determine its level of "extremeness". Intuitively, we think of this as calculating an anomalous p-value. This makes COPOD both parameter-free, highly interpretable, and computationally efficient. In this work, we make three key contributions, 1) propose a novel, parameter-free outlier detection algorithm with both great performance and interpretability, 2) perform extensive experiments on 30 benchmark datasets to show that COPOD outperforms in most cases and is also one of the fastest algorithms, and 3) release an easy-to-use Python implementation for reproducibility.

preprint2020arXiv

Exploiting Visual Semantic Reasoning for Video-Text Retrieval

Video retrieval is a challenging research topic bridging the vision and language areas and has attracted broad attention in recent years. Previous works have been devoted to representing videos by directly encoding from frame-level features. In fact, videos consist of various and abundant semantic relations to which existing methods pay less attention. To address this issue, we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions. Specifically, we consider frame regions as vertices and construct a fully-connected semantic correlation graph. Then, we perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations. With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed. Finally, the region features are aggregated to form frame-level features for further encoding to measure video-text similarity. Extensive experiments on two public benchmark datasets validate the effectiveness of our method by achieving state-of-the-art performance due to the powerful semantic reasoning.

preprint2020arXiv

Photoinduced Vibrations Drive Ultrafast Structural Distortion in Lead Halide Perovskite

Organic-inorganic perovskites have shown great promise towards their application in optoelectronics. The success of this class of material is dictated by the complex interplay between various underlying microscopic phenomena. The structural dynamics of organic cations and the inorganic sublattice after photoexcitation is hypothesized to have a direct effect on the material properties, thereby affecting the overall device performance. Here, we use two-dimensional (2D) electronic spectroscopy to reveal impulsively excited vibrational modes of methylammonium (MA) lead iodide perovskite, which drive the structural distortion after photoexcitation. The vibrational analysis of the measured data allows us to directly monitor the time evolution of the librational motion of the MA cation along with the vibrational coherences of inorganic sublattice. Wavelet analysis of the observed vibrational coherences uncovers the interplay between these two types of phonons. It reveals the coherent generation of the librational motion of the MA cation within ~300 fs, which is complemented by the coherent evolution of the skeletal motion of the inorganic sublattice. We have employed time-dependent density functional theory (TDDFT) to study the atomic motion of the MA cation and the inorganic sublattice during the process of photoexcitation. The TDDFT calculations support our experimental observations of the coherent generation of librational motions in the MA cation and highlight the importance of the anharmonic interaction between the MA cation and the inorganic sublattice. Our calculations predict the transfer of the photoinduced vibrational coherence from the MA cation to the inorganic sublattice, which drives the skeleton motion to form a polaronic state leading to long lifetimes of the charge carriers. This work may lead to novel design principles for next generation of solar cell materials.

preprint2020arXiv

Rate Splitting for Multi-Antenna Downlink: Precoder Design and Practical Implementation

Rate splitting (RS) is a potentially powerful and flexible technique for multi-antenna downlink transmission. In this paper, we address several technical challenges towards its practical implementation for beyond 5G systems. To this end, we focus on a single-cell system with a multi-antenna base station (BS) and K single-antenna receivers. We consider RS in its most general form, and joint decoding to fully exploit the potential of RS. First, we investigate the achievable rates under joint decoding and formulate the precoder design problems to maximize a general utility function, or to minimize the transmit power under pre-defined rate targets. Building upon the concave-convex procedure (CCCP), we propose precoder design algorithms for an arbitrary number of users. Our proposed algorithms approximate the intractable non-convex problems with a number of successively refined convex problems, and provably converge to stationary points of the original problems. Then, to reduce the decoding complexity, we consider the optimization of the precoder and the decoding order under successive decoding. Further, we propose a stream selection algorithm to reduce the number of precoded signals. With a reduced number of streams and successive decoding at the receivers, our proposed algorithm can even be implemented when the number of users is relatively large, whereas the complexity was previously considered as prohibitively high in the same setting. Finally, we propose a simple adaptation of our algorithms to account for the imperfection of the channel state information at the transmitter. Numerical results demonstrate that the general RS scheme provides a substantial performance gain as compared to state-of-the-art linear precoding schemes, especially with a moderately large number of users.

preprint2020arXiv

Research on Annotation Rules and Recognition Algorithm Based on Phrase Window

At present, most Natural Language Processing technology is based on the results of Word Segmentation for Dependency Parsing, which mainly uses an end-to-end method based on supervised learning. There are two main problems with this method: firstly, the la-beling rules are complex and the data is too difficult to label, the workload of which is large; secondly, the algorithm cannot recognize the multi-granularity and diversity of language components. In order to solve these two problems, we propose labeling rules based on phrase windows, and designed corresponding phrase recognition algorithms. The labeling rule uses phrases as the minimum unit, di-vides sentences into 7 types of nestable phrase types, and marks the grammatical dependencies between phrases. The corresponding algorithm, drawing on the idea of identifying the target area in the image field, can find the start and end positions of various phrases in the sentence, and realize the synchronous recognition of nested phrases and grammatical dependencies. The results of the experiment shows that the labeling rule is convenient and easy to use, and there is no ambiguity; the algorithm is more grammatically multi-granular and diverse than the end-to-end algorithm. Experiments on the CPWD dataset improve the accuracy of the end-to-end method by about 1 point. The corresponding method was applied to the CCL2018 competition, and the first place in the Chinese Metaphor Sentiment Analysis Task.

preprint2020arXiv

Research on multi-dimensional end-to-end phrase recognition algorithm based on background knowledge

At present, the deep end-to-end method based on supervised learning is used in entity recognition and dependency analysis. There are two problems in this method: firstly, background knowledge cannot be introduced; secondly, multi granularity and nested features of natural language cannot be recognized. In order to solve these problems, the annotation rules based on phrase window are proposed, and the corresponding multi-dimensional end-to-end phrase recognition algorithm is designed. This annotation rule divides sentences into seven types of nested phrases, and indicates the dependency between phrases. The algorithm can not only introduce background knowledge, recognize all kinds of nested phrases in sentences, but also recognize the dependency between phrases. The experimental results show that the annotation rule is easy to use and has no ambiguity; the matching algorithm is more consistent with the multi granularity and diversity characteristics of syntax than the traditional end-to-end algorithm. The experiment on CPWD dataset, by introducing background knowledge, the new algorithm improves the accuracy of the end-to-end method by more than one point. The corresponding method was applied to the CCL 2018 competition and won the first place in the task of Chinese humor type recognition.

preprint2020arXiv

SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources

A synthetic dataset is a data object that is generated programmatically, and it may be valuable to creating a single dataset from multiple sources when direct collection is difficult or costly. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In this paper, we study a specific synthetic data generation task called downscaling, a procedure to infer high-resolution, harder-to-collect information (e.g., individual level records) from many low-resolution, easy-to-collect sources, and propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula). For given low-resolution datasets, the central idea of SYNC is to fit Gaussian copula models to each of the low-resolution datasets in order to correctly capture dependencies and marginal distributions, and then sample from the fitted models to obtain the desired high-resolution subsets. Predictive models are then used to merge sampled subsets into one, and finally, sampled datasets are scaled according to low-resolution marginal constraints. We make four key contributions in this work: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of-the-art machine learning and statistical techniques, 2) perform simulation studies to validate SYNC's performance as a synthetic data generation algorithm, 3) demonstrate its value as a feature engineering tool, as well as an alternative to data collection in situations where gathering is difficult through two real-world datasets, 4) release an easy-to-use framework implementation for reproducibility and scalability at the production level that easily incorporates new data.

preprint2016arXiv

BAUM: A DNA Assembler by Adaptive Unique Mapping and Local Overlap-Layout-Consensus

Genome assembly from the high-throughput sequencing (HTS) reads is a fundamental yet challenging computational problem. An intrinsic challenge is the uncertainty caused by the widespread repetitive elements. Here we get around the uncertainty using the notion of uniquely mapped (UM) reads, which motivated the design of a new assembler BAUM. It mainly consists of two types of iterations. The first type of iterations constructs initial contigs from a reference, say a genome of a species that could be quite distant, by adaptive read mapping, filtration by the reference's unique regions, and reference updating. A statistical test is proposed to split the layouts at possible structural variation sites. The second type of iterations includes mapping, scaffolding/contig-extension, and contig merging. We extend each contig by locally assembling the reads whose mates are uniquely mapped to an end of the contig. Instead of the de Bruijn graph method, we take the overlap-layout-consensus (OLC) paradigm. The OLC is implemented by parallel computation, and has linear complexity with respect to the number of contigs. The adjacent extended contigs are merged if their alignment is confirmed by the adjusted gap distance. Throughout the assembling, the mapping criterion is selected by probabilistic calculations. These innovations can be used complementary to the existing de novo assemblers. Applying this novel method to the assembly of wild rice Oryza longistaminata genome, we achieved much improved contig N50, 18.8k, compared with other assemblers. The assembly was further validated by contigs constructed from an independent library of long 454 reads.

preprint2016arXiv

Double MRT Thermal Lattice Boltzmann Method for Simulating Natural Convection of Low Prandtl Number Fluids

The purposes of this paper are testing an efficiency algorithm based on LBM and using it to analyze two-dimensional natural convection with low Prandtl number. Steady state or oscillatory results are obtained using double multiple-relaxation-time thermal lattice Boltzmann method. The velocity and temperature fields are solved using D2Q9 and D2Q5 models, respectively. With different Rayleigh number, the tested natural convection can either achieve to steady state or oscillatory. With fixed Rayleigh number, lower Prandtl number leads to a weaker convection effect, longer oscillation period and higher oscillation amplitude for the cases reaching oscillatory solutions. At fixed Prandtl number, higher Rayleigh number leads to a more notable convection effect and longer oscillation period. Double multiple-relaxation-time thermal lattice Boltzmann method is applied to simulate the low Prandtl number fluid natural convection. Rayleigh number and Prandtl number effects are also investigated when the natural convection results oscillate.

preprint2016arXiv

Effects of Slotted Structures on Nonlinear Characteristics of Natural Convection in a Cylinder with an Internal Concentric Slotted Annulus

Natural convection in a cylinder with an internally slotted annulus was solved by SIMPLE algorithm, and the effects of different slotted structures on nonlinear characteristics of natural convection were investigated. The results show that the equivalent thermal conductivity Keq increases with Rayleigh number, and reaches the maximum in the vertical orientation. Nonlinear results were obtained by simulating the fluid flow at different conditions. With increasing Rayleigh number, heat transfer is intensified and the state of heat transfer changes from the steady to unsteady. We investigated different slotted structures effects on natural convection, and analyze the corresponding nonlinear characteristics.

preprint2016arXiv

Interface induced high temperature superconductivity in single unit-cell FeSe films on SrTiO3(110)

We report high temperature superconductivity in one unit-cell (1-UC) FeSe films grown on STO(110) substrate by molecular beam epitaxy. By in-situ scanning tunneling spectroscopy measurement, we observed a superconducting gap as large as 17 meV. Transport measurements on 1-UC FeSe/STO(110) capped with FeTe layers reveal superconductivity with an onset TC of 31.6 K and an upper critical magnetic field of 30.2 T. We also find that the TC can be further increased by an external electric field, but the effect is smaller than that on STO(001) substrate. The study points out the important roles of interface related charge transfer and electron-phonon coupling in the high temperature superconductivity of FeSe/STO.

preprint2016arXiv

Lattice Boltzmann Method Simulation of 3-D Melting Using Double MRT Model with Interfacial Tracking Method

Three-dimensional melting problems are investigated numerically with Lattice Boltzmann method (LBM). Regarding algorithm's accuracy and stability, Multiple-Relaxation-Time (MRT) models are employed to simplify the collision term in LBM. Temperature and velocity fields are solved with double distribution functions, respectively. 3-D melting problems are solved with double MRT models for the first time in this article. The key point for the numerical simulation of a melting problem is the methods to obtain the location of the melting front and this article uses interfacial tracking method. The interfacial tracking method combines advantages of both deforming and fixed grid approaches. The location of the melting front was obtained by calculating the energy balance at the solid-liquid interface. Various 3-D conduction controlled melting problems are solved firstly to verify the numerical method. Liquid fraction tendency and temperature distribution obtained from numerical methods agree with the analytical results well. The proposed double MRT model with interfacial tracking method is valid to solve 3-D melting problems. Different 3-D convection controlled melting problems are then solved with the proposed numerical method. Various locations of the heat surface have different melting front moving velocities, due to the natural convection effects. Rayleigh number's effects to the 3-D melting process is discussed.

preprint2015arXiv

Atomically Resolved FeSe/SrTiO3(001) Interface Structure by Scanning Transmission Electron Microscopy

Interface-enhanced high-temperature superconductivity in one unit-cell (UC) FeSe films on SrTiO3(001) (STO) substrate has recently attracted much attention in condensed matter physics and material science. By combined in-situ scanning tunneling microscopy/spectroscopy (STM/STS) and ex-situ scanning transmission electron microscopy (STEM) studies, we report on atomically resolved structure including both lattice constants and actual atomic positions of the FeSe/STO interface under both non-superconducting and superconducting states. We observed TiO2 double layers (DLs) and significant atomic displacements in the top two layers of STO, lattice compression of the Se-Fe-Se triple layer, and relative shift between bottom Se and topmost Ti atoms. By imaging the interface structures under various superconducting states, we unveil a close correlation between interface structure and superconductivity. Our atomic-scale identification of FeSe/STO interface structure provides useful information on investigating the pairing mechanism of this interface-enhanced high-temperature superconducting system.

preprint2015arXiv

Interface enhanced electron-phonon coupling and high temperature superconductivity in potassium-coated ultra-thin FeSe films on SrTiO3

Alkali-metal (potassium) adsorption on FeSe thin films with thickness from two unit cells (UC) to 4-UC on SrTiO3 grown by molecular beam epitaxy is investigated with a low-temperature scanning tunneling microscope. At appropriate potassium coverage (0.2-0.3 monolayer), the tunneling spectra of the films all exhibit a superconducting-like gap larger than 11 meV (five times the gap value of bulk FeSe), and two distinct features of characteristic phonon modes at 11 meV and 21 meV. The results reveal the critical role of the interface enhanced electron-phonon coupling for possible high temperature superconductivity in the system and is consistent with recent theories. Our study provides compelling evidence for the conventional pairing mechanism for this type of heterostructure superconducting systems.

preprint2015arXiv

Lattice Boltzmann Method Simulation of 3-D Natural Convection with Double MRT Model

Multiple-relaxation-time model (MRT) has more advantages than the many others approaches in the Lattice Boltzmann Method (LBM). Three-dimensional double MRT model is proposed for the first time for fluid flow and heat transfer simulation. Three types of cubic natural convection problems are solved with proposed method at various Rayleigh numbers. Two opposite vertical walls on the left and right are kept at different temperatures for all three types, while the remained four walls are either adiabatic or have linear temperature variations. For the first two types of cubic natural convections that four walls are either adiabatic or vary linearly, the present results agreed very well with the benchmark solutions or experimental results in the literature. For the third type of cubic natural convection, the front and back surfaces has linearly variable temperature while the bottom and top surface are adiabatic. The results from the third type exhibited more general three-dimensional characters.

preprint2015arXiv

Superconductivity dichotomy in K-coated single and double unit cell FeSe films on SrTiO3

We report the superconductivity evolution of one unit cell (1-UC) and 2-UC FeSe films on SrTiO3(001) substrates with potassium (K) adsorption. By in situ scanning tunneling spectroscopy measurement, we find that the superconductivity in 1-UC FeSe films is continuously suppressed with increasing K coverage, whereas non-superconducting 2-UC FeSe films become superconducting with a gap of ~17 meV or ~11 meV depending on whether the underlying 1-UC films are superconducting or not. This work explicitly reveals that the interface electron-phonon coupling is strongly related to the charge transfer at FeSe/STO interface and plays vital role in enhancing Cooper pairing in both 1-UC and 2-UC FeSe films.

preprint2015arXiv

The more Product Complexity, the more Actual Effort? An Empirical Investigation into Software Developments

[Background:] Software effort prediction methods and models typically assume positive correlation between software product complexity and development effort. However, conflicting observations, i.e. negative correlation between product complexity and actual effort, have been witnessed from our experience with the COCOMO81 dataset. [Aim:] Given our doubt about whether the observed phenomenon is a coincidence, this study tries to investigate if an increase in product complexity can result in the abovementioned counter-intuitive trend in software development projects. [Method:] A modified association rule mining approach is applied to the transformed COCOMO81 dataset. To reduce noise of analysis, this approach uses a constant antecedent (Complexity increases while Effort decreases) to mine potential consequents with pruning. [Results:] The experiment has respectively mined four, five, and seven association rules from the general, embedded, and organic projects data. The consequents of the mined rules suggested two main aspects, namely human capability and product scale, to be particularly concerned in this study. [Conclusions:] The negative correlation between complexity and effort is not a coincidence under particular conditions. In a software project, interactions between product complexity and other factors, such as Programmer Capability and Analyst Capability, can inevitably play a "friction" role in weakening the practical influences of product complexity on actual development effort.

preprint2015arXiv

Thermal and nonthermal melting of silicon under femtosecond x-ray irradiation

As it is known from visible light experiments, silicon under femtosecond pulse irradiation can undergo the so-called 'nonthermal melting' if the density of electrons excited from the valence to the conduction band overcomes a certain critical value. Such ultrafast transition is induced by strong changes in the atomic potential energy surface, which trigger atomic relocation. However, heating of a material due to the electron-phonon coupling can also lead to a phase transition, called 'thermal melting'. This thermal melting can occur even if the excited-electron density is much too low to induce non-thermal effects. To study phase transitions, and in particular, the interplay of the thermal and nonthermal effects in silicon under a femtosecond x-ray irradiation, we propose their unified treatment by going beyond the Born-Oppenheimer approximation within our hybrid model based on tight binding molecular dynamics. With our extended model we identify damage thresholds for various phase transitions in irradiated silicon. We show that electron-phonon coupling triggers the phase transition of solid silicon into a low-density liquid phase if the energy deposited into the sample is above $\sim0.65$ eV per atom. For the deposited doses of over $\sim0.9$ eV per atom, solid silicon undergoes a phase transition into high-density liquid phase triggered by an interplay between electron-phonon heating and nonthermal effects. These thresholds are much lower than those predicted with the Born-Oppenheimer approximation ($\sim2.1$ eV/atom), and indicate a significant contribution of electron-phonon coupling to the relaxation of the laser-excited silicon. We expect that these results will stimulate dedicated experimental studies, unveiling in detail various paths of structural relaxation within laser-irradiated silicon.

preprint2014arXiv

A model combining spectrum standardization and dominant factor based partial least square method for carbon analysis in coal by laser-induced breakdown spectroscopy

Successful quantitative measurement of carbon content in coal using laser-induced breakdown spectroscopy (LIBS) is suffered from relatively low precision and accuracy. In the present work, the spectrum standardization method was combined with the dominant factor based partial least square (PLS) method to improve the measurement accuracy of carbon content in coal by LIBS. The combination model employed the spectrum standardization method to convert the carbon line intensity into standard state for more accurately calculating the dominant carbon concentration, and then applied PLS with full spectrum information to correct the residual errors. The combination model was applied to the measurement of carbon content for 24 bituminous coal samples. The results demonstrated that the combination model could further improve the measurement accuracy compared with both our previously established spectrum standardization model and dominant factor based PLS model using spectral area normalized intensity for the dominant factor model. For example, the coefficient of determination (R2), the root-mean-square error of prediction (RMSEP), and the average relative error (ARE) for the combination model were 0.99, 1.75%, and 2.39%, respectively; while those values for the spectrum standardization method were 0.83, 2.71%, and 3.40%, respectively; and those values for the dominant factor based PLS model were 0.99, 2.66%, and 3.64%, respectively.

preprint2014arXiv

Dynamics of fluctuations in a quantum system

"\textit{The noise is the signal}"[R. Landauer, Nature \textbf{392}, 658 (1998)] emphasizes the rich information content encoded in fluctuations. This paper assesses the dynamical role of fluctuations of a quantum system driven far from equilibrium, with laser-aligned molecules as a physical realization. Time evolutions of the expectation value and the uncertainty of a standard observable are computed quantum mechanically and classically. We demonstrate the intricate dynamics of the uncertainty that are strikingly independent of those of the expectation value, and their exceptional sensitivity to quantum properties of the system. In general, detecting the time evolution of the fluctuations of a given observable provides information on the dynamics of correlations in a quantum system.

preprint2014arXiv

HERMES: A Hierarchical Broadcast-Based Silicon Photonic Interconnect for Scalable Many-Core Systems

Optical interconnection networks, as enabled by recent advances in silicon photonic device and fabrication technology, have the potential to address on-chip and off-chip communication bottlenecks in many-core systems. Although several designs have shown superior power efficiency and performance compared to electrical alternatives, these networks will not scale to the thousands of cores required in the future. In this paper, we introduce Hermes, a hybrid network composed of an optimized broadcast for power-efficient low-latency global-scale coordination and circuit-switch sub-networks for high-throughput data delivery. This network will scale for use in thousand core chip systems. At the physical level, SoI-based adiabatic coupler has been designed to provide low-loss and compact optical power splitting. Based on the adiabatic coupler, a topology based on 2-ary folded butterfly is designed to provide linear power division in a thousand core layout with minimal cross-overs. To address the network agility and provide for efficient use of optical bandwidth, a flow control and routing mechanism is introduced to dynamically allocate bandwidth and provide fairness usage of network resources. At the system level, bloom filter-based filtering for localization of communication are designed for reducing global traffic. In addition, a novel greedy-based data and workload migration are leveraged to increase the locality of communication in a NUCA (non-uniform cache access) architecture. First order analytic evaluation results have indicated that Hermes is scalable to at least 1024 cores and offers significant performance improvement and power savings over prior silicon photonic designs.

preprint2014arXiv

Multiconfiguration time-dependent Hartree impurity solver for nonequilibrium dynamical mean-field theory

Nonequilibrium dynamical mean-field theory (DMFT) solves correlated lattice models by obtaining their local correlation functions from an effective model consisting of a single impurity in a self-consistently determined bath. The recently developed mapping of this impurity problem from the Keldysh time contour onto a time-dependent single-impurity Anderson model (SIAM) [C. Gramsch et al., Phys. Rev. B 88, 235106 (2013)] allows one to use wave function-based methods in the context of nonequilibrium DMFT. Within this mapping, long times in the DMFT simulation become accessible by an increasing number of bath orbitals, which requires efficient representations of the time-dependent SIAM wave function. These can be achieved by the multiconfiguration time-dependent Hartree (MCTDH) method and its multi-layer extensions. We find that MCTDH outperforms exact diagonalization for large baths in which the latter approach is still within reach and allows for the calculation of SIAMs beyond the system size accessible by exact diagonalization. Moreover, we illustrate the computation of the self-consistent two-time impurity Green's function within the MCTDH second quantization representation.

preprint2014arXiv

Noise Equivalent Counts Based Emission Image Reconstruction Algorithm of Tomographic Gamma Scanning

Tomographic Gamma Scanning (TGS) is a technique used to assay the nuclide distribution and radioactivity in nuclear waste drums. Both transmission and emission scans are performed in TGS and the transmission image is used for the attenuation correction in emission reconstructions. The error of the transmission image, which is not considered by the existing reconstruction algorithms, negatively affects the final results. An emission reconstruction method based on Noise Equivalent Counts (NEC) is presented. Noises from the attenuation image are concentrated to the projection data to apply the NEC Maximum-Likelihood Expectation-Maximization algorithm. Experiments are performed to verify the effectiveness of the proposed method.

preprint2014arXiv

Studies of LL-type 500MHz 5-cell superconducting cavity at SINAP

A low loss (LL) type 500 MHz 5-cell superconducting niobium prototype cavity with large beam aperture has been developed successfully including the optimization, the deep drawing and electron beam welding, the surface treatment and the vertical testing. The performance of the fundamental mode was optimized and the higher order modes were damped by adopting an enlarged beam pipe for propagation. Surface preparation or treatment including mechanical polishing, buffered chemical polishing and high pressure rinsing with ultra-pure water and so on was carried out carefully to promise a perfect inner surface condition. The vertical testing results show that the accelerating voltage higher than 7.5 MV was obtained while the quality factor was better than 1E9 at 4.2 K. No obvious multipacting or field emission was found during the test. However, a quench happened while increasing the field a little higher than 7.5 MV that at present limited the cavity performance.

preprint2014arXiv

The application of spectrum standardization method for carbon analysis in coal using laser-induced breakdown spectroscopy

Measurements of carbon content in coal using laser-induced breakdown spectroscopy (LIBS) is limited by its low measurement precision and accuracy. A spectrum standardization method was proposed to achieve both reproducible and accurate results for the quantitative analysis of carbon content in coal with LIBS. The proposed method utilized the molecular carbon emissions to compensate the diminution of atomic carbon emission caused by matrix effect. The compensated carbon line intensities were further converted into an assumed standard state with fixed plasma temperature, electron density, and total number density of elemental carbon, which is proportional to its concentration in the coal samples. In addition, in order to obtained better compensation for total carbon number density fluctuations, an iterative algorithm was applied, which is different from our previous standardization calculations. The modified spectrum standardization model was applied to the measurement of carbon content in 24 bituminous coal samples. The results demonstrated that the proposed method had superior performance over the generally applied normalization methods. The average relative standard deviation, the coefficient of determination, the root-mean-square error of prediction, and the average maximum relative error for the modified model were 3.44%, 0.83, 2.71%, and 12.61%, respectively, while the corresponding values for the normalization with segmental spectrum area were 6.00%, 0.75, 3.77%, and 15.40%, respectively, showing an overwhelming improvement.

preprint2013arXiv

A Factor Framework for Experimental Design for Performance Evaluation of Commercial Cloud Services

Given the diversity of commercial Cloud services, performance evaluations of candidate services would be crucial and beneficial for both service customers (e.g. cost-benefit analysis) and providers (e.g. direction of service improvement). Before an evaluation implementation, the selection of suitable factors (also called parameters or variables) plays a prerequisite role in designing evaluation experiments. However, there seems a lack of systematic approaches to factor selection for Cloud services performance evaluation. In other words, evaluators randomly and intuitively concerned experimental factors in most of the existing evaluation studies. Based on our previous taxonomy and modeling work, this paper proposes a factor framework for experimental design for performance evaluation of commercial Cloud services. This framework capsules the state-of-the-practice of performance evaluation factors that people currently take into account in the Cloud Computing domain, and in turn can help facilitate designing new experiments for evaluating Cloud services.

preprint2013arXiv

Building an Expert System for Evaluation of Commercial Cloud Services

Commercial Cloud services have been increasingly supplied to customers in industry. To facilitate customers' decision makings like cost-benefit analysis or Cloud provider selection, evaluation of those Cloud services are becoming more and more crucial. However, compared with evaluation of traditional computing systems, more challenges will inevitably appear when evaluating rapidly-changing and user-uncontrollable commercial Cloud services. This paper proposes an expert system for Cloud evaluation that addresses emerging evaluation challenges in the context of Cloud Computing. Based on the knowledge and data accumulated by exploring the existing evaluation work, this expert system has been conceptually validated to be able to give suggestions and guidelines for implementing new evaluation experiments. As such, users can conveniently obtain evaluation experiences by using this expert system, which is essentially able to make existing efforts in Cloud services evaluation reusable and sustainable.

preprint2013arXiv

Circumstantial-Evidence-Based Judgment for Software Effort Estimation

Expert judgment for software effort estimation is oriented toward direct evidences that refer to actual effort of similar projects or activities through experts' experiences. However, the availability of direct evidences implies the requirement of suitable experts together with past data. The circumstantial-evidence-based judgment proposed in this paper focuses on the development experiences deposited in human knowledge, and can then be used to qualitatively estimate implementation effort of different proposals of a new project by rational inference. To demonstrate the process of circumstantial-evidence-based judgment, this paper adopts propositional learning theory based diagnostic reasoning to infer and compare different effort estimates when implementing a Web service composition project with some different techniques and contexts. The exemplar shows our proposed work can help determine effort tradeoff before project implementation. Overall, circumstantial-evidence-based judgment is not an alternative but complementary to expert judgment so as to facilitate and improve software effort estimation.

preprint2013arXiv

Early Observations on Performance of Google Compute Engine for Scientific Computing

Although Cloud computing emerged for business applications in industry, public Cloud services have been widely accepted and encouraged for scientific computing in academia. The recently available Google Compute Engine (GCE) is claimed to support high-performance and computationally intensive tasks, while little evaluation studies can be found to reveal GCE's scientific capabilities. Considering that fundamental performance benchmarking is the strategy of early-stage evaluation of new Cloud services, we followed the Cloud Evaluation Experiment Methodology (CEEM) to benchmark GCE and also compare it with Amazon EC2, to help understand the elementary capability of GCE for dealing with scientific problems. The experimental results and analyses show both potential advantages of, and possible threats to applying GCE to scientific computing. For example, compared to Amazon's EC2 service, GCE may better suit applications that require frequent disk operations, while it may not be ready yet for single VM-based parallel computing. Following the same evaluation methodology, different evaluators can replicate and/or supplement this fundamental evaluation of GCE. Based on the fundamental evaluation results, suitable GCE environments can be further established for case studies of solving real science problems.

preprint2013arXiv

Effort-Oriented Classification Matrix of Web Service Composition

Within the service-oriented computing domain, Web service composition is an effective realization to satisfy the rapidly changing requirements of business. Therefore, the research into Web service composition has unfolded broadly. Since examining all of the related work in this area becomes a mission next to impossible, the classification of composition approaches can be used to facilitate multiple research tasks. However, the current attempts to classify Web service composition do not have clear objectives. Furthermore, the contexts and technologies of composition approaches are confused in the existing classifications. This paper proposes an effort-oriented classification matrix for Web service composition, which distinguishes between the context and technology dimension. The context dimension is aimed at analyzing the environment influence on the effort of Web service composition, while the technology dimension focuses on the technique influence on the effort. Consequently, besides the traditional classification benefits, this matrix can be used to build the basis of cost estimation for Web service composition in future research.

preprint2013arXiv

On a Catalogue of Metrics for Evaluating Commercial Cloud Services

Given the continually increasing amount of commercial Cloud services in the market, evaluation of different services plays a significant role in cost-benefit analysis or decision making for choosing Cloud Computing. In particular, employing suitable metrics is essential in evaluation implementations. However, to the best of our knowledge, there is not any systematic discussion about metrics for evaluating Cloud services. By using the method of Systematic Literature Review (SLR), we have collected the de facto metrics adopted in the existing Cloud services evaluation work. The collected metrics were arranged following different Cloud service features to be evaluated, which essentially constructed an evaluation metrics catalogue, as shown in this paper. This metrics catalogue can be used to facilitate the future practice and research in the area of Cloud services evaluation. Moreover, considering metrics selection is a prerequisite of benchmark selection in evaluation implementations, this work also supplements the existing research in benchmarking the commercial Cloud services.

preprint2013arXiv

Software Cost Estimation Framework for Service-Oriented Architecture Systems using Divide-and-Conquer Approach

Due to the complexity of Service-Oriented Architecture (SOA), cost and effort estimation for SOA-based software development is more difficult than that for traditional software development. Unfortunately, there is a lack of published work about cost and effort estimation for SOA-based software. Existing cost estimation approaches are inadequate to address the complex service-oriented systems. This paper proposes a novel framework based on Divide-and-Conquer (D&C) for cost estimation for building SOA-based software. By dealing with separately development parts, the D&C framework can help organizations simplify and regulate SOA implementation cost estimation. Furthermore, both cost estimation modeling and software sizing work can be satisfied respectively by switching the corresponding metrics within this framework. Given the requirement of developing these metrics, this framework also defines the future research in four different directions according to the separate cost estimation sub-problems.

preprint2013arXiv

The Cloud's Cloudy Moment: A Systematic Survey of Public Cloud Service Outage

Inadequate service availability is the top concern when employing Cloud computing. It has been recognized that zero downtime is impossible for large-scale Internet services. By learning from the previous and others' mistakes, nevertheless, it is possible for Cloud vendors to minimize the risk of future downtime or at least keep the downtime short. To facilitate summarizing lessons for Cloud providers, we performed a systematic survey of public Cloud service outage events. This paper reports the result of this survey. In addition to a set of findings, our work generated a lessons framework by classifying the outage root causes. The framework can in turn be used to arrange outage lessons for reference by Cloud providers. By including potentially new root causes, this lessons framework will be smoothly expanded in our future work.

preprint2013arXiv

Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services

Cloud Computing, as one of the most promising computing paradigms, has become increasingly accepted in industry. Numerous commercial providers have started to supply public Cloud services, and corresponding performance evaluation is then inevitably required for Cloud provider selection or cost-benefit analysis. Unfortunately, inaccurate and confusing evaluation implementations can be often seen in the context of commercial Cloud Computing, which could severely interfere and spoil evaluation-related comprehension and communication. This paper introduces a taxonomy to help profile and standardize the details of performance evaluation of commercial Cloud services. Through a systematic literature review, we constructed the taxonomy along two dimensions by arranging the atomic elements of Cloud-related performance evaluation. As such, this proposed taxonomy can be employed both to analyze existing evaluation practices through decomposition into elements and to design new experiments through composing elements for evaluating performance of commercial Cloud services. Moreover, through smooth expansion, we can continually adapt this taxonomy to the more general area of evaluation of Cloud Computing.

preprint2013arXiv

Towards Technology Independent Strategies for SOA Implementations

Benefiting from the technology based strategies, Service-Oriented Architecture (SOA) has been able to achieve the general goals such as agility, flexibility, reusability and efficiency. Nevertheless, technical conditions alone cannot guarantee successful SOA implementations. As a valuable and necessary supplement, the space of technology independent strategies should also be explored. Through treating SOA system as an instance of organization and identifying the common ground on the similar process of SOA implementation and organization design, this paper uses existing work in organization theory area to inspire the research into technology independent strategies of SOA implementation. As a result, four preliminary strategies that can be applied to organizational area we identify to support SOA implementations. Furthermore, the novel methodology of investigating technology independent strategies for implementing SOA is revealed, which encourages interdisciplinary research across service-oriented computing and organization theory.

preprint2012arXiv

Correlated dynamics of the motion of proton-hole wave-packets in a photoionized water cluster

We explore the correlated dynamics of an electron-hole and a proton after ionization of a protonated water cluster by extreme ultra-violet (XUV) light. An ultrafast decay mechanism is found in which the proton--hole dynamics after the ionization are driven by electrostatic repulsion and involve a strong coupling between the nuclear and electronic degrees of freedom. We describe the system by a quantum-dynamical approach and show that non-adiabatic effects are a key element of the mechanism by which electron and proton repel each other and become localized at opposite sides of the cluster. Based on the generality of the decay mechanism, similar effects may be expected for other ionized systems featuring hydrogen bonds.

preprint2011arXiv

A Non-linearized PLS Model Based on Multivariate Dominant Factor for Laser-induced Breakdown Spectroscopy Measurements

A multivariate dominant factor based non-linearized PLS model is proposed. The intensities of different lines were taken to construct a multivariate dominant factor model, which describes the dominant concentration information of the measured species. In constructing such a multivariate model, non-linear transformation of multi characteristic line intensities according to the physical mechanisms of lased induced plasma spectrum were made, combined with linear-correlation-based PLS method, to model the nonlinear self-absorption and inter-element interference effects. This enables the linear PLS method to describe non-linear relationship more accurately and provides the statistics-based PLS method with physical backgrounds. Moreover, a secondary PLS is applied utilizing the whole spectra information to further correct the model results. Experiments were conducted using standard brass samples. Taylor expansion was applied to make the nonlinear transformation to describe the self-absorption effect of Cu. Then, line intensities of another two elements, Pb and Zn, were taken into account for inter-element interference. The proposed method shows a significant improvement when compared with conventional PLS model. Results also show that, even compared with the already-improved baseline dominant-factor-based PLS model, the present PLS model based on the multivariate dominant factor yields the same calibration quality (R2=0.999) while decreasing the RMSEP from 2.33% to 1.97%. The overall RMSE was also improved to 1.05% from 1.27%.

preprint2011arXiv

Bed-inventory Overturn Mechanism for Pant-leg Circulating Fluidized Bed Boilers

A numerical model was established to investigate the lateral mass transfer as well as the mechanism of bed-inventory overturn inside a pant-leg circulating fluidized bed (CFB), which are of great importance to maintain safe and efficient operation of the CFB. Results show that the special flow structure in which the solid particle volume fraction along the central line of the pant-leg CFB is relative high enlarges the lateral mass transfer rate and make it more possible for bed inventory overturn. Although the lateral pressure difference generated from lateral mass transfer inhibits continuing lateral mass transfer, providing the pant-leg CFB with self-balancing ability to some extent, the primary flow rate change due to the outlet pressure change often disable the self-balancing ability by continually enhancing the flow rate difference. As the flow rate of the primary air fan is more sensitive to its outlet pressure, it is easier to lead to bed inventory overturn. While when the solid particle is easier to change its flow patter to follow the surrounding air flow,the self-balancing ability is more active.

preprint2011arXiv

Spectrum standardization for laser-induced breakdown spectroscopy measurements

This paper presents a spectra normalization method for laser-induced breakdown spectroscopy (LIBS) measurements by converting the recorded characteristic line intensity at varying conditions to the intensity under a standard condition with standard plasma temperature, degree of ionization, and total number density of the interested species to reduce the measurement uncertainty. The characteristic line intensities of the interested species are first converted to the intensity at a fixed temperature and standard degree of ionization but varying total number density for each laser pulse analysis. Under this state, if the influence of the variation of plasma morphology is neglected, the sum of multiple spectral line intensities for the measured element can be regarded proportional to the total number density of the specific element, and the fluctuation of the total number density, or the variation of ablation mass, was compensated for by the application of this relationship. In the experiments with 29 brass alloy samples, the application of this method to determine Cu concentration shows a significant improvement over generally applied normalization method for measurement precision and accuracy. The average RSD value, average value of the error bar, R2, RMSEP, and average value of the maximum relative error were: 5.29%, 0.68%, 0.98, 2.72%, 16.97%, respectively, while the above parameter values for normalization with the whole spectrum area were: 8.61%, 1.37%, 0.95, 3.28%, 29.19%, respectively.

preprint2010arXiv

A Novel Multivariate Model Based on Dominant Factor for Laser-induced Breakdown Spectroscopy Measurements

This paper presents a new approach of applying partial least squares method combined with a physical principle based dominant factor. The characteristic line intensity of the specific element was taken to build up the dominant factor to reflect the major elemental concentration and partial least squares (PLS) approach was then applied to further improve the model accuracy. The deviation evolution of characteristic line intensity from the ideal condition was depicted and according to the deviation understanding, efforts were taken to model the non-linear self-absorption and inter-element interference effects to improve the accuracy of dominant factor model. With a dominant factor to carry the main quantitative information, the novel multivariate model combines advantages of both the conventional univariate and PLS models and partially avoids the overuse of the unrelated noise in the spectrum for PLS application. The dominant factor makes the combination model more robust over a wide concentration range and PLS application improves the model accuracy for samples with matrices within the calibration sample set. Results show that RMSEP of the final dominant factor based PLS model decreased to 2.33% from 5.25% when using the conventional PLS approach with full spectral information. Furthermore, with the development in understanding the physics of the laser-induced plasma, there is potential to easily improve the accuracy of the dominant factor model as well as the proposed novel multivariate model.

Zheng Li

What is connected

Connect this record

See the researcher in context

Building this map preview

80 published item(s)

A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables

Beyond Known Fakes: Generalized Detection of AI-Generated Images via Post-hoc Distribution Alignment

Field-induced magnetic phase transitions and transport anomalies in GdAlSi

Pressure-Free Surface-Induced Flow by Geometric Rectification

VidLeaks: Membership Inference Attacks Against Text-to-Video Models

Anomalous Hall effect and rich magnetic phase diagram of Mn$_{100-x}$Rh$_{x}$ epitaxial films

HY-MT1.5 Technical Report

Kinetic Catalysis of Spontaneous Knotting: How Free Particles Modulate Filament Entanglement

Deep Learning-Based Knowledge Injection for Metaphor Detection: A Comprehensive Review

Backdoor Attacks Against Dataset Distillation

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Ultrafast X-ray Diffraction Probe of Coherent Spin-state Dynamics in Molecules

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Anomalous thermal Hall effect and anomalous Nernst effect of CsV$_{3}$Sb$_{5}$

Auditing Membership Leakages of Multi-Exit Networks

Condensing Graphs via One-Step Gradient Matching

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization

ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions

Indexing Metric Spaces for Exact Similarity Search

Just Enough, Just in Time, Just for "Me": Fundamental Principles for Engineering IoT-native Software Systems

Membership-Doctor: Comprehensive Assessment of Membership Inference Against Machine Learning Models

Multilingual Knowledge Graph Completion with Self-Supervised Adaptive Graph Alignment

Neutrino Rocket Jet Model: An Explanation of High-velocity Pulsars and their Spin-down Evolution

Online Knowledge Distillation for Efficient Pose Estimation

Opportunities of Hybrid Model-based Reinforcement Learning for Cell Therapy Manufacturing Process Control

Perspective: Ultrafast Imaging of Molecular Dynamics Using Ultrafast Low-Frequency Lasers, X-ray Free Electron Laser and Electron Pulses

RETE: Retrieval-Enhanced Temporal Event Forecasting on Unified Query Product Evolutionary Graph

Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

Towards Reproducible Evaluations for Flying Drone Controllers in Virtual Environments

Design and Control of a Highly Redundant Rigid-Flexible Coupling Robot to Assist the COVID-19 Oropharyngeal-Swab Sampling

Long Live The Image: Container-Native Data Persistence in Production

On a Factorial Knowledge Architecture for Data Science-powered Software Engineering

Reconstruction of Quantitative Susceptibility Maps from Phase of Susceptibility Weighted Imaging with Cross-Connected $Ψ$-Net

Selective quantum Zeno effect of ultracold atom-molecule scattering in dynamic magnetic fields

Stop Building Castles on a Swamp! The Crisis of Reproducing Automatic Search in Evidence-based Software Engineering

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Superresolving second-order correlation imaging using synthesized colored noise speckles

Tri-Hexagonal charge order in kagome metal CsV$_{3}$Sb$_{5}$ revealed by $^{121}$Sb NQR

A Dynamic Subspace Based BFGS Method for Large Scale Optimization Problem

COPOD: Copula-Based Outlier Detection

Exploiting Visual Semantic Reasoning for Video-Text Retrieval

Photoinduced Vibrations Drive Ultrafast Structural Distortion in Lead Halide Perovskite

Rate Splitting for Multi-Antenna Downlink: Precoder Design and Practical Implementation

Research on Annotation Rules and Recognition Algorithm Based on Phrase Window

Research on multi-dimensional end-to-end phrase recognition algorithm based on background knowledge

SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources

BAUM: A DNA Assembler by Adaptive Unique Mapping and Local Overlap-Layout-Consensus

Double MRT Thermal Lattice Boltzmann Method for Simulating Natural Convection of Low Prandtl Number Fluids

Effects of Slotted Structures on Nonlinear Characteristics of Natural Convection in a Cylinder with an Internal Concentric Slotted Annulus

Interface induced high temperature superconductivity in single unit-cell FeSe films on SrTiO3(110)

Lattice Boltzmann Method Simulation of 3-D Melting Using Double MRT Model with Interfacial Tracking Method

Atomically Resolved FeSe/SrTiO3(001) Interface Structure by Scanning Transmission Electron Microscopy

Interface enhanced electron-phonon coupling and high temperature superconductivity in potassium-coated ultra-thin FeSe films on SrTiO3

Lattice Boltzmann Method Simulation of 3-D Natural Convection with Double MRT Model

Superconductivity dichotomy in K-coated single and double unit cell FeSe films on SrTiO3

The more Product Complexity, the more Actual Effort? An Empirical Investigation into Software Developments

Thermal and nonthermal melting of silicon under femtosecond x-ray irradiation

A model combining spectrum standardization and dominant factor based partial least square method for carbon analysis in coal by laser-induced breakdown spectroscopy

Dynamics of fluctuations in a quantum system

HERMES: A Hierarchical Broadcast-Based Silicon Photonic Interconnect for Scalable Many-Core Systems

Multiconfiguration time-dependent Hartree impurity solver for nonequilibrium dynamical mean-field theory

Noise Equivalent Counts Based Emission Image Reconstruction Algorithm of Tomographic Gamma Scanning

Studies of LL-type 500MHz 5-cell superconducting cavity at SINAP

The application of spectrum standardization method for carbon analysis in coal using laser-induced breakdown spectroscopy

A Factor Framework for Experimental Design for Performance Evaluation of Commercial Cloud Services

Building an Expert System for Evaluation of Commercial Cloud Services

Circumstantial-Evidence-Based Judgment for Software Effort Estimation

Early Observations on Performance of Google Compute Engine for Scientific Computing

Effort-Oriented Classification Matrix of Web Service Composition

On a Catalogue of Metrics for Evaluating Commercial Cloud Services

Software Cost Estimation Framework for Service-Oriented Architecture Systems using Divide-and-Conquer Approach

The Cloud's Cloudy Moment: A Systematic Survey of Public Cloud Service Outage

Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services