Source author record

Xiaolin Li

Xiaolin Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Cryptography and Security Computation and Language Computer Vision cond-mat.mtrl-sci Biological Physics cond-mat.mes-hall Genomics Human-Computer Interaction Information Theory math.IT Methodology Multimedia Neural and Evolutionary Computing physics.atom-ph Tissues and Organs

Catalog footprint

What is connected

20works

17topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

ES Attack: Model Stealing against Deep Neural Networks without Data Hurdles

Deep neural networks (DNNs) have become the essential components for various commercialized machine learning services, such as Machine Learning as a Service (MLaaS). Recent studies show that machine learning services face severe privacy threats - well-trained DNNs owned by MLaaS providers can be stolen through public APIs, namely model stealing attacks. However, most existing works undervalued the impact of such attacks, where a successful attack has to acquire confidential training data or auxiliary data regarding the victim DNN. In this paper, we propose ES Attack, a novel model stealing attack without any data hurdles. By using heuristically generated synthetic data, ES Attack iteratively trains a substitute model and eventually achieves a functionally equivalent copy of the victim DNN. The experimental results reveal the severity of ES Attack: i) ES Attack successfully steals the victim model without data hurdles, and ES Attack even outperforms most existing model stealing attacks using auxiliary data in terms of model accuracy; ii) most countermeasures are ineffective in defending ES Attack; iii) ES Attack facilitates further attacks relying on the stolen model.

preprint2022arXiv

Group-wise Reinforcement Feature Generation for Optimal and Explainable Representation Space Reconstruction

Representation (feature) space is an environment where data points are vectorized, distances are computed, patterns are characterized, and geometric structures are embedded. Extracting a good representation space is critical to address the curse of dimensionality, improve model generalization, overcome data sparsity, and increase the availability of classic models. Existing literature, such as feature engineering and representation learning, is limited in achieving full automation (e.g., over heavy reliance on intensive labor and empirical experiences), explainable explicitness (e.g., traceable reconstruction process and explainable new features), and flexible optimal (e.g., optimal feature space reconstruction is not embedded into downstream tasks). Can we simultaneously address the automation, explicitness, and optimal challenges in representation space reconstruction for a machine learning task? To answer this question, we propose a group-wise reinforcement generation perspective. We reformulate representation space reconstruction into an interactive process of nested feature generation and selection, where feature generation is to generate new meaningful and explicit features, and feature selection is to eliminate redundant features to control feature sizes. We develop a cascading reinforcement learning method that leverages three cascading Markov Decision Processes to learn optimal generation policies to automate the selection of features and operations and the feature crossing. We design a group-wise generation strategy to cross a feature group, an operation, and another feature group to generate new features and find the strategy that can enhance exploration efficiency and augment reward signals of cascading agents. Finally, we present extensive experiments to demonstrate the effectiveness, efficiency, traceability, and explicitness of our system.

preprint2022arXiv

Semi-supervised Drifted Stream Learning with Short Lookback

In many scenarios, 1) data streams are generated in real time; 2) labeled data are expensive and only limited labels are available in the beginning; 3) real-world data is not always i.i.d. and data drift over time gradually; 4) the storage of historical streams is limited and model updating can only be achieved based on a very short lookback window. This learning setting limits the applicability and availability of many Machine Learning (ML) algorithms. We generalize the learning task under such setting as a semi-supervised drifted stream learning with short lookback problem (SDSL). SDSL imposes two under-addressed challenges on existing methods in semi-supervised learning, continuous learning, and domain adaptation: 1) robust pseudo-labeling under gradual shifts and 2) anti-forgetting adaptation with short lookback. To tackle these challenges, we propose a principled and generic generation-replay framework to solve SDSL. The framework is able to accomplish: 1) robust pseudo-labeling in the generation step; 2) anti-forgetting adaption in the replay step. To achieve robust pseudo-labeling, we develop a novel pseudo-label classification model to leverage supervised knowledge of previously labeled data, unsupervised knowledge of new data, and, structure knowledge of invariant label semantics. To achieve adaptive anti-forgetting model replay, we propose to view the anti-forgetting adaptation task as a flat region search problem. We propose a novel minimax game-based replay objective function to solve the flat region search problem and develop an effective optimization solver. Finally, we present extensive experiments to demonstrate our framework can effectively address the task of anti-forgetting learning in drifted streams with short lookback.

preprint2021arXiv

Modeling and Computation of High Efficiency and Efficacy Multi-Step Batch Testing for Infectious Diseases

We propose a mathematical model based on probability theory to optimize COVID-19 testing by a multi-step batch testing approach with variable batch sizes. This model and simulation tool dramatically increase the efficiency and efficacy of the tests in a large population at a low cost, particularly when the infection rate is low. The proposed method combines statistical modeling with numerical methods to solve nonlinear equations and obtain optimal batch sizes at each step of tests, with the flexibility to incorporate geographic and demographic information. In theory, this method substantially improves the false positive rate and positive predictive value as well. We also conducted a Monte Carlo simulation to verify this theory. Our simulation results show that our method significantly reduces the false negative rate. More accurate assessment can be made if the dilution effect or other practical factors are taken into consideration. The proposed method will be particularly useful for the early detection of infectious diseases and prevention of future pandemics. The proposed work will have broader impacts on medical testing for contagious diseases in general.

preprint2020arXiv

A Batch Normalized Inference Network Keeps the KL Vanishing Away

Variational Autoencoder (VAE) is widely used as a generative model to approximate a model's posterior on latent variables by combining the amortized variational inference and deep neural networks. However, when paired with strong autoregressive decoders, VAE often converges to a degenerated local optimum known as "posterior collapse". Previous approaches consider the Kullback Leibler divergence (KL) individual for each datapoint. We propose to let the KL follow a distribution across the whole dataset, and analyze that it is sufficient to prevent posterior collapse by keeping the expectation of the KL's distribution positive. Then we propose Batch Normalized-VAE (BN-VAE), a simple but effective approach to set a lower bound of the expectation by regularizing the distribution of the approximate posterior's parameters. Without introducing any new model component or modifying the objective, our approach can avoid the posterior collapse effectively and efficiently. We further show that the proposed BN-VAE can be extended to conditional VAE (CVAE). Empirically, our approach surpasses strong autoregressive baselines on language modeling, text classification and dialogue generation, and rivals more complex approaches while keeping almost the same training time as VAE.

preprint2020arXiv

A Praise for Defensive Programming: Leveraging Uncertainty for Effective Malware Mitigation

A promising avenue for improving the effectiveness of behavioral-based malware detectors would be to combine fast traditional machine learning detectors with high-accuracy, but time-consuming deep learning models. The main idea would be to place software receiving borderline classifications by traditional machine learning methods in an environment where uncertainty is added, while software is analyzed by more time-consuming deep learning models. The goal of uncertainty would be to rate-limit actions of potential malware during the time consuming deep analysis. In this paper, we present a detailed description of the analysis and implementation of CHAMELEON, a framework for realizing this uncertain environment for Linux. CHAMELEON offers two environments for software: (i) standard - for any software identified as benign by conventional machine learning methods and (ii) uncertain - for software receiving borderline classifications when analyzed by these conventional machine learning methods. The uncertain environment adds obstacles to software execution through random perturbations applied probabilistically on selected system calls. We evaluated CHAMELEON with 113 applications and 100 malware samples for Linux. Our results showed that at threshold 10%, intrusive and non-intrusive strategies caused approximately 65% of malware to fail accomplishing their tasks, while approximately 30% of the analyzed benign software to meet with various levels of disruption. With a dynamic, per-system call threshold, CHAMELEON caused 92% of the malware to fail, and only 10% of the benign software to be disrupted. We also found that I/O-bound software was three times more affected by uncertainty than CPU-bound software. Further, we analyzed the logs of software crashed with non-intrusive strategies, and found that some crashes are due to the software bugs.

preprint2020arXiv

Arbitrary-sized Image Training and Residual Kernel Learning: Towards Image Fraud Identification

Preserving original noise residuals in images are critical to image fraud identification. Since the resizing operation during deep learning will damage the microstructures of image noise residuals, we propose a framework for directly training images of original input scales without resizing. Our arbitrary-sized image training method mainly depends on the pseudo-batch gradient descent (PBGD), which bridges the gap between the input batch and the update batch to assure that model updates can normally run for arbitrary-sized images. In addition, a 3-phase alternate training strategy is designed to learn optimal residual kernels for image fraud identification. With the learnt residual kernels and PBGD, the proposed framework achieved the state-of-the-art results in image fraud identification, especially for images with small tampered regions or unseen images with different tampering distributions.

preprint2020arXiv

Asking Complex Questions with Multi-hop Answer-focused Reasoning

Asking questions from natural language text has attracted increasing attention recently, and several schemes have been proposed with promising results by asking the right question words and copy relevant words from the input to the question. However, most state-of-the-art methods focus on asking simple questions involving single-hop relations. In this paper, we propose a new task called multihop question generation that asks complex and semantically relevant questions by additionally discovering and modeling the multiple entities and their semantic relations given a collection of documents and the corresponding answer 1. To solve the problem, we propose multi-hop answer-focused reasoning on the grounded answer-centric entity graph to include different granularity levels of semantic information including the word-level and document-level semantics of the entities and their semantic relations. Through extensive experiments on the HOTPOTQA dataset, we demonstrate the superiority and effectiveness of our proposed model that serves as a baseline to motivate future work.

preprint2020arXiv

Connecting Web Event Forecasting with Anomaly Detection: A Case Study on Enterprise Web Applications Using Self-Supervised Neural Networks

Recently web applications have been widely used in enterprises to assist employees in providing effective and efficient business processes. Forecasting upcoming web events in enterprise web applications can be beneficial in many ways, such as efficient caching and recommendation. In this paper, we present a web event forecasting approach, DeepEvent, in enterprise web applications for better anomaly detection. DeepEvent includes three key features: web-specific neural networks to take into account the characteristics of sequential web events, self-supervised learning techniques to overcome the scarcity of labeled data, and sequence embedding techniques to integrate contextual events and capture dependencies among web events. We evaluate DeepEvent on web events collected from six real-world enterprise web applications. Our experimental results demonstrate that DeepEvent is effective in forecasting sequential web events and detecting web based anomalies. DeepEvent provides a context-based system for researchers and practitioners to better forecast web events with situational awareness.

preprint2020arXiv

Improving Question Generation with Sentence-level Semantic Matching and Answer Position Inferring

Taking an answer and its context as input, sequence-to-sequence models have made considerable progress on question generation. However, we observe that these approaches often generate wrong question words or keywords and copy answer-irrelevant words from the input. We believe that lacking global question semantics and exploiting answer position-awareness not well are the key root causes. In this paper, we propose a neural question generation model with two concrete modules: sentence-level semantic matching and answer position inferring. Further, we enhance the initial state of the decoder by leveraging the answer-aware gated fusion mechanism. Experimental results demonstrate that our model outperforms the state-of-the-art (SOTA) models on SQuAD and MARCO datasets. Owing to its generality, our work also improves the existing models significantly.

preprint2020arXiv

Knowledge Federation: A Unified and Hierarchical Privacy-Preserving AI Framework

With strict protections and regulations of data privacy and security, conventional machine learning based on centralized datasets is confronted with significant challenges, making artificial intelligence (AI) impractical in many mission-critical and data-sensitive scenarios, such as finance, government, and health. In the meantime, tremendous datasets are scattered in isolated silos in various industries, organizations, different units of an organization, or different branches of an international organization. These valuable data resources are well underused. To advance AI theories and applications, we propose a comprehensive framework (called Knowledge Federation - KF) to address these challenges by enabling AI while preserving data privacy and ownership. Beyond the concepts of federated learning and secure multi-party computation, KF consists of four levels of federation: (1) information level, low-level statistics and computation of data, meeting the requirements of simple queries, searching and simplistic operators; (2) model level, supporting training, learning, and inference; (3) cognition level, enabling abstract feature representation at various levels of abstractions and contexts; (4) knowledge level, fusing knowledge discovery, representation, and reasoning. We further clarify the relationship and differentiation between knowledge federation and other related research areas. We have developed a reference implementation of KF, called iBond Platform, to offer a production-quality KF platform to enable industrial applications in finance, insurance et al. The iBond platform will also help establish the KF community and a comprehensive ecosystem and usher in a novel paradigm shift towards secure, privacy-preserving and responsible AI. As far as we know, knowledge federation is the first hierarchical and unified framework for secure multi-party computing and learning.

preprint2020arXiv

PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Although substantial efforts have been made to learn disentangled representations under the variational autoencoder (VAE) framework, the fundamental properties to the dynamics of learning of most VAE models still remain unknown and under-investigated. In this work, we first propose a novel learning objective, termed the principle-of-relevant-information variational autoencoder (PRI-VAE), to learn disentangled representations. We then present an information-theoretic perspective to analyze existing VAE models by inspecting the evolution of some critical information-theoretic quantities across training epochs. Our observations unveil some fundamental properties associated with VAEs. Empirical results also demonstrate the effectiveness of PRI-VAE on four benchmark data sets.

preprint2018arXiv

Comparing Clinical Judgment with MySurgeryRisk Algorithm for Preoperative Risk Assessment: A Pilot Study

Background: Major postoperative complications are associated with increased short and long-term mortality, increased healthcare cost, and adverse long-term consequences. The large amount of data contained in the electronic health record (EHR) creates barriers for physicians to recognize patients most at risk. We hypothesize, if presented in an optimal format, information from data-driven predictive risk algorithms for postoperative complications can improve physician risk assessment. Methods: Prospective, non-randomized, interventional pilot study of twenty perioperative physicians at a quarterly academic medical center. Using 150 clinical cases we compared physicians' risk assessment before and after interaction with MySurgeryRisk, a validated machine-learning algorithm predicting preoperative risk for six major postoperative complications using EHR data. Results: The area under the curve (AUC) of MySurgeryRisk algorithm ranged between 0.73 and 0.85 and was significantly higher than physicians' risk assessments (AUC between 0.47 and 0.69) for all postoperative complications except cardiovascular complications. The AUC for repeated physician's risk assessment improved by 2% to 5% for all complications with the exception of thirty-day mortality. Physicians' risk assessment for acute kidney injury and intensive care unit admission longer than 48 hours significantly improved after knowledge exchange, resulting in net reclassification improvement of 12.4% and 16%, respectively. Conclusions: The validated MySurgeryRisk algorithm predicted postoperative complications with equal or higher accuracy than pilot cohort of physicians using available clinical preoperative data. The interaction with algorithm significantly improved physicians' risk assessment.

preprint2016arXiv

DeepCancer: Detecting Cancer through Gene Expressions via Deep Generative Learning

Transcriptional profiling on microarrays to obtain gene expressions has been used to facilitate cancer diagnosis. We propose a deep generative machine learning architecture (called DeepCancer) that learn features from unlabeled microarray data. These models have been used in conjunction with conventional classifiers that perform classification of the tissue samples as either being cancerous or non-cancerous. The proposed model has been tested on two different clinical datasets. The evaluation demonstrates that DeepCancer model achieves a very high precision score, while significantly controlling the false positive and false negative scores.

preprint2015arXiv

Identifying the Absorption Bump with Deep Learning

The pervasive interstellar dust grains provide significant insights to understand the formation and evolution of the stars, planetary systems, and the galaxies, and may harbor the building blocks of life. One of the most effective way to analyze the dust is via their interaction with the light from background sources. The observed extinction curves and spectral features carry the size and composition information of dust. The broad absorption bump at 2175 Angstrom is the most prominent feature in the extinction curves. Traditionally, statistical methods are applied to detect the existence of the absorption bump. These methods require heavy preprocessing and the co-existence of other reference features to alleviate the influence from the noises. In this paper, we apply Deep Learning techniques to detect the broad absorption bump. We demonstrate the key steps for training the selected models and their results. The success of Deep Learning based method inspires us to generalize a common methodology for broader science discovery problems. We present our on-going work to build the DeepDis system for such kind of applications.

preprint2011arXiv

Secure Multiplex Coding Over Interference Channel with Confidential Messages

In this paper, inner and outer bounds on the capacity region of two-user interference channels with two confidential messages have been proposed. By adding secure multiplex coding to the error correction method in [15] which achieves the best achievable capacity region for interference channel up to now, we have shown that the improved secure capacity region compared with [2] now is the whole Han-Kobayashi region. In addition, this construction not only removes the rate loss incurred by adding dummy messages to achieve security, but also change the original weak security condition in [2] to strong security. Then the equivocation rate for a collection of secret messages has also been evaluated, when the length of the message is finite or the information rate is high, our result provides a good approximation for bounding the worst case equivocation rate. Our results can be readily extended to the Gaussian interference channel with little efforts.

preprint2010arXiv

Edge Magneto-Fingerprints in Disordered Graphene Nanoribbons

We report on (magneto)-transport experiments in chemically derived narrow graphene nanoribbons under high magnetic fields (up to 60 Tesla). Evidences of field-dependent electronic confinement features are given, and allow estimating the possible ribbon edge symmetry. Besides, the measured large positive magnetoconductance indicates a strong suppression of backscattering induced by the magnetic field. Such scenario is supported by quantum simulations which consider different types of underlying disorders (smooth edge disorder and long range Coulomb scatters).

preprint2010arXiv

Multiplexed five-color molecular imaging of cancer cells and tumor tissues with carbon nanotube Raman tags in the near-infrared

Single-walled carbon nanotubes (SWNTs) with five different C13/C12 isotope compositions and well-separated Raman peaks have been synthesized and conjugated to five targeting ligands in order to impart molecular specificity. Multiplexed Raman imaging of live cells has been carried out by highly specific staining of cells with a five-color mixture of SWNTs. Ex vivo multiplexed Raman imaging of tumor samples uncovers a surprising up-regulation of epidermal growth factor receptor (EGFR) on LS174T colon cancer cells from cell culture to in vivo tumor growth. This is the first time five-color multiplexed molecular imaging has been performed in the near-infrared (NIR) region under a single laser excitation. Near zero interfering background of imaging is achieved due to the sharp Raman peaks unique to nanotubes over the low, smooth autofluorescence background of biological species.

preprint2008arXiv

Bose-Einstein condensation on an atom chip

We report an experiment of creating Bose-Einstein condensate (BEC) on an atom chip. The chip based Z-wire current and a homogeneous bias magnetic field create a tight magnetic trap, which allows for a fast production of BEC. After an 4.17s forced radio frequency evaporative cooling, a condensate with about 3000 atoms appears. And the transition temperature is about 300nK. This compact system is quite robust, allowing for versatile extensions and further studying of BEC.

preprint2008arXiv

Highly Conducting Graphene Sheets and Langmuir-Blodgett Films

Graphene is an intriguing material with properties that are distinct from those of other graphitic systems. The first samples of pristine graphene were obtained by peeling off and epitaxial growth. Recently, the chemical reduction of graphite oxide was used to produce covalently functionalized single-layer graphene oxide. However, chemical approaches for the large-scale production of highly conducting graphene sheets remain elusive. Here, we report that the exfoliation-reintercalation-expansion of graphite can produce high-quality single-layer graphene sheets stably suspended in organic solvents. The graphene sheets exhibit high electrical conductance at room and cryogenic temperatures. Large amounts of graphene sheets in organic solvents are made into large transparent conducting films by Langmuir-Blodgett assembly in a layer-by-layer manner. The chemically derived high quality graphene sheets could lead to future scalable graphene devices.

Xiaolin Li

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

ES Attack: Model Stealing against Deep Neural Networks without Data Hurdles

Group-wise Reinforcement Feature Generation for Optimal and Explainable Representation Space Reconstruction

Semi-supervised Drifted Stream Learning with Short Lookback

Modeling and Computation of High Efficiency and Efficacy Multi-Step Batch Testing for Infectious Diseases

A Batch Normalized Inference Network Keeps the KL Vanishing Away

A Praise for Defensive Programming: Leveraging Uncertainty for Effective Malware Mitigation

Arbitrary-sized Image Training and Residual Kernel Learning: Towards Image Fraud Identification

Asking Complex Questions with Multi-hop Answer-focused Reasoning

Connecting Web Event Forecasting with Anomaly Detection: A Case Study on Enterprise Web Applications Using Self-Supervised Neural Networks

Improving Question Generation with Sentence-level Semantic Matching and Answer Position Inferring

Knowledge Federation: A Unified and Hierarchical Privacy-Preserving AI Framework

PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Comparing Clinical Judgment with MySurgeryRisk Algorithm for Preoperative Risk Assessment: A Pilot Study

DeepCancer: Detecting Cancer through Gene Expressions via Deep Generative Learning

Identifying the Absorption Bump with Deep Learning

Secure Multiplex Coding Over Interference Channel with Confidential Messages

Edge Magneto-Fingerprints in Disordered Graphene Nanoribbons

Multiplexed five-color molecular imaging of cancer cells and tumor tissues with carbon nanotube Raman tags in the near-infrared

Bose-Einstein condensation on an atom chip

Highly Conducting Graphene Sheets and Langmuir-Blodgett Films