Source author record

Hai Li

Hai Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

39works

21topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Evaluating Accounting Reasoning Capabilities of Large Language Models

Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.

preprint2023arXiv

HCE: Improving Performance and Efficiency with Heterogeneously Compressed Neural Network Ensemble

Ensemble learning has gain attention in resent deep learning research as a way to further boost the accuracy and generalizability of deep neural network (DNN) models. Recent ensemble training method explores different training algorithms or settings on multiple sub-models with the same model architecture, which lead to significant burden on memory and computation cost of the ensemble model. Meanwhile, the heurtsically induced diversity may not lead to significant performance gain. We propose a new prespective on exploring the intrinsic diversity within a model architecture to build efficient DNN ensemble. We make an intriguing observation that pruning and quantization, while both leading to efficient model architecture at the cost of small accuracy drop, leads to distinct behavior in the decision boundary. To this end, we propose Heterogeneously Compressed Ensemble (HCE), where we build an efficient ensemble with the pruned and quantized variants from a pretrained DNN model. An diversity-aware training objective is proposed to further boost the performance of the HCE ensemble. Experiemnt result shows that HCE achieves significant improvement in the efficiency-accuracy tradeoff comparing to both traditional DNN ensemble training methods and previous model compression methods.

preprint2022arXiv

FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning

Client-wise data heterogeneity is one of the major issues that hinder effective training in federated learning (FL). Since the data distribution on each client may vary dramatically, the client selection strategy can significantly influence the convergence rate of the FL process. Active client selection strategies are popularly proposed in recent studies. However, they neglect the loss correlations between the clients and achieve only marginal improvement compared to the uniform selection strategy. In this work, we propose FedCor -- an FL framework built on a correlation-based client selection strategy, to boost the convergence rate of FL. Specifically, we first model the loss correlations between the clients with a Gaussian Process (GP). Based on the GP model, we derive a client selection strategy with a significant reduction of expected global loss in each round. Besides, we develop an efficient GP training method with a low communication overhead in the FL scenario by utilizing the covariance stationarity. Our experimental results show that compared to the state-of-the-art method, FedCorr can improve the convergence rates by $34\%\sim 99\%$ and $26\%\sim 51\%$ on FMNIST and CIFAR-10, respectively.

preprint2022arXiv

Field free switching through bulk spin-orbit torque in L10-FePt films deposited on vicinal substrates

L10-FePt distinguishes itself for its ultrahigh perpendicular magnetic anisotropy (PMA), which enables memory cells with sufficient thermal stability to scale down to 3 nm. The recently discovered "bulk" spin-orbit torques in L10-FePt provide an efficient and scalable way to manipulate the L10-FePt magnetization. However, the existence of external field during the switching limits its practical application, and therefore field-free switching of the L10-FePt is in highly demand. In this manuscript, we demonstrate the field-free switching of the L10-FePt by growing it on vicinal MgO (001) substrates. This method is different from previously established strategies, as it does not need to add other functional layers or create asymmetry in the film structure. We demonstrate the field-free switching is robust and can withstand strong field disturbance up to ~1 kOe. The dependence on vicinal angle, film thickness, and growth temperature demonstrated a wide operation window for the field-free switching of the L10-FePt. We confirmed that the physical origin of the field-free switching is the vicinal surface-induced the tilted anisotropy of L10-FePt. We quantitatively characterize the spin-orbit torques in the L10-FePt films, and found the spin-orbit torques are not significantly influenced by the lattice strain from vicinal substrates. Our results extend beyond the established strategies to realize field-free switching, and potentially could be applied to other magnetic and antiferromagnetic systems.

preprint2022arXiv

Fine-grain Inference on Out-of-Distribution Data with Hierarchical Classification

Machine learning methods must be trusted to make appropriate decisions in real-world environments, even when faced with out-of-distribution (OOD) samples. Many current approaches simply aim to detect OOD examples and alert the user when an unrecognized input is given. However, when the OOD sample significantly overlaps with the training data, a binary anomaly detection is not interpretable or explainable, and provides little information to the user. We propose a new model for OOD detection that makes predictions at varying levels of granularity as the inputs become more ambiguous, the model predictions become coarser and more conservative. Consider an animal classifier that encounters an unknown bird species and a car. Both cases are OOD, but the user gains more information if the classifier recognizes that its uncertainty over the particular species is too large and predicts bird instead of detecting it as OOD. Furthermore, we diagnose the classifiers performance at each level of the hierarchy improving the explainability and interpretability of the models predictions. We demonstrate the effectiveness of hierarchical classifiers for both fine- and coarse-grained OOD tasks.

preprint2022arXiv

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

Prosody modeling is important, but still challenging in expressive voice conversion. As prosody is difficult to model, and other factors, e.g., speaker, environment and content, which are entangled with prosody in speech, should be removed in prosody modeling. In this paper, we present IQDubbing to solve this problem for expressive voice conversion. To model prosody, we leverage the recent advances in discrete self-supervised speech representation (DSSR). Specifically, prosody vector is first extracted from pre-trained VQ-Wav2Vec model, where rich prosody information is embedded while most speaker and environment information are removed effectively by quantization. To further filter out the redundant information except prosody, such as content and partial speaker information, we propose two kinds of prosody filters to sample prosody from the prosody vector. Experiments show that IQDubbing is superior to baseline and comparison systems in terms of speech quality while maintaining prosody consistency and speaker similarity.

preprint2022arXiv

Joint Optimization of STAR-RIS Assisted UAV Communication Systems

In this letter, we study the simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted unmanned aerial vehicle (UAV) communications. Our goal is to maximize the sum rate of all users by jointly optimizing the STAR-RIS's beamforming vectors, the UAV's trajectory and power allocation. We decompose the formulated non-convex problem into three subproblems and solve them alternately to obtain the solution. Simulations show that: 1) the STAR-RIS achieves a higher sum rate than traditional RIS; 2) to exploit the benefits of STAR-RIS, the UAV's trajectory is closer to STAR-RIS than that of RIS; 3) the energy splitting for reflection and transmission highly depends on the real-time trajectory of UAV.

preprint2022arXiv

Learning and Compositionality: a Unification Attempt via Connectionist Probabilistic Programming

We consider learning and compositionality as the key mechanisms towards simulating human-like intelligence. While each mechanism is successfully achieved by neural networks and symbolic AIs, respectively, it is the combination of the two mechanisms that makes human-like intelligence possible. Despite the numerous attempts on building hybrid neuralsymbolic systems, we argue that our true goal should be unifying learning and compositionality, the core mechanisms, instead of neural and symbolic methods, the surface approaches to achieve them. In this work, we review and analyze the strengths and weaknesses of neural and symbolic methods by separating their forms and meanings (structures and semantics), and propose Connectionist Probabilistic Program (CPPs), a framework that connects connectionist structures (for learning) and probabilistic program semantics (for compositionality). Under the framework, we design a CPP extension for small scale sequence modeling and provide a learning algorithm based on Bayesian inference. Although challenges exist in learning complex patterns without supervision, our early results demonstrate CPP's successful extraction of concepts and relations from raw sequential data, an initial step towards compositional learning.

preprint2022arXiv

Nonvolatile Electric Field Control of Thermal Magnons in the Absence of an Applied Magnetic Field

Spin transport through magnetic insulators has been demonstrated in a variety of materials and is an emerging pathway for next-generation spin-based computing. To modulate spin transport in these systems, one typically applies a sufficiently strong magnetic field to allow for deterministic control of magnetic order. Here, we make use of the well-known multiferroic magnetoelectric, BiFeO3, to demonstrate non-volatile, hysteretic, electric-field control of thermally excited magnon current in the absence of an applied magnetic field. These findings are an important step toward magnon-based devices, where electric-field-only control is highly desirable.

preprint2022arXiv

ReaLPrune: ReRAM Crossbar-aware Lottery Ticket Pruned CNNs

Training machine learning (ML) models at the edge (on-chip training on end user devices) can address many pressing challenges including data privacy/security, increase the accessibility of ML applications to different parts of the world by reducing the dependence on the communication fabric and the cloud infrastructure, and meet the real-time requirements of AR/VR applications. However, existing edge platforms do not have sufficient computing capabilities to support complex ML tasks such as training large CNNs. ReRAM-based architectures offer high-performance yet energy efficient computing platforms for on-chip CNN training/inferencing. However, ReRAM-based architectures are not scalable with the size of the CNN. Larger CNNs have more weights, which requires more ReRAM cells that cannot be integrated in a single chip. Moreover, training larger CNNs on-chip will require higher power, which cannot be afforded by these smaller devices. Pruning is an effective way to solve this problem. However, existing pruning techniques are either targeted for inferencing only, or they are not crossbar-aware. This leads to sub-optimal hardware savings and performance benefits for CNN training on ReRAM-based architectures. In this paper, we address this problem by proposing a novel crossbar-aware pruning strategy, referred as ReaLPrune, which can prune more than 90% of CNN weights. The pruned model can be trained from scratch without any accuracy loss. Experimental results indicate that ReaLPrune reduces hardware requirements by 77.2% and accelerates CNN training by ~20X compared to unpruned CNNs. ReaLPrune also outperforms other crossbar-aware pruning techniques in terms of both performance and hardware savings. In addition, ReaLPrune is equally effective for diverse datasets and more complex CNNs

preprint2022arXiv

Relevance between Information scrambling and quantum Darwinism

Quantum system interacting with environment can induce redundant encoding of the information of system into a multipartite environment, which is the essence of quantum Darwinism. At the same time, environment may scramble the initially localized information about the system. We mainly investigate the relevance between information scrambling in environment and the emergence of quantum Darwinism. First, we generally identify that when the system shows a Darwinistic behavior system information that is initially localized in the environment is not scrambled, while when Darwinism disappears scrambling occurs.We then verify our result through a collision model where the system, consisting of one or two qubits, interacts with an ensemble of environmental ancillas.Moreover, dependent on the nature of system-environment interactions, our results also shows that the single qubit and two-qubit systems behave differently for the emergence of QD and the scrambling, but the above relevance between them remains valid.

preprint2021arXiv

A Case for 3D Integrated System Design for Neuromorphic Computing & AI Applications

Over the last decade, artificial intelligence has found many applications areas in the society. As AI solutions have become more sophistication and the use cases grew, they highlighted the need to address performance and energy efficiency challenges faced during the implementation process. To address these challenges, there has been growing interest in neuromorphic chips. Neuromorphic computing relies on non von Neumann architectures as well as novel devices, circuits and manufacturing technologies to mimic the human brain. Among such technologies, 3D integration is an important enabler for AI hardware and the continuation of the scaling laws. In this paper, we overview the unique opportunities 3D integration provides in neuromorphic chip design, discuss the emerging opportunities in next generation neuromorphic architectures and review the obstacles. Neuromorphic architectures, which relied on the brain for inspiration and emulation purposes, face grand challenges due to the limited understanding of the functionality and the architecture of the human brain. Yet, high-levels of investments are dedicated to develop neuromorphic chips. We argue that 3D integration not only provides strategic advantages to the cost-effective and flexible design of neuromorphic chips, it may provide design flexibility in incorporating advanced capabilities to further benefits the designs in the future.

preprint2021arXiv

Anisotropic Andreev reflection in semi-Dirac materials

In the framework of Bogoliubov-de Gennes equation, we theoretically study the Andreev reflection in normal-superconducting junctions based on semi-Dirac materials. Owing to the intrinsic anisotropy of semi-Dirac materials, the configuration of Andreev reflection and differential conductance are strongly orientation-dependent. For the transport along the linear dispersion direction, the differential conductance exhibits a clear crossover from retro Andreev reflection to specular Andreev reflection with an increasing bias-voltage, and the differential conductance oscillates without a decaying profile when the interfacial barrier strength increases. However, for the transport along the quadratic dispersion direction, the boundary between the retro Andreev reflection and specular Andreev reflection is ambiguous, and the differential conductance decays with increasing the momentum mismatch or the interfacial barrier strength. We illustrate the pseudo-spin textures to reveal the underling physics behind the anisotropic coherent transport properties. These results enrich the understanding of the superconducting coherent transport in semi-Dirac materials.

preprint2021arXiv

BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity. We consider each bit of quantized weights as an independent trainable variable and introduce a differentiable bit-sparsity regularizer. BSQ can induce all-zero bits across a group of weight elements and realize the dynamic precision reduction, leading to a mixed-precision quantization scheme of the original model. Our method enables the exploration of the full mixed-precision space with a single gradient-based optimization process, with only one hyperparameter to tradeoff the performance and compression. BSQ achieves both higher accuracy and higher bit reduction on various model architectures on the CIFAR-10 and ImageNet datasets comparing to previous methods.

preprint2021arXiv

Improving Adversarial Robustness in Weight-quantized Neural Networks

Neural networks are getting deeper and more computation-intensive nowadays. Quantization is a useful technique in deploying neural networks on hardware platforms and saving computation costs with negligible performance loss. However, recent research reveals that neural network models, no matter full-precision or quantized, are vulnerable to adversarial attacks. In this work, we analyze both adversarial and quantization losses and then introduce criteria to evaluate them. We propose a boundary-based retraining method to mitigate adversarial and quantization losses together and adopt a nonlinear mapping method to defend against white-box gradient-based adversarial attacks. The evaluations demonstrate that our method can better restore accuracy after quantization than other baseline methods on both black-box and white-box adversarial attacks. The results also show that adversarial training suffers quantization loss and does not cooperate well with other training methods.

preprint2021arXiv

On Provable Backdoor Defense in Collaborative Learning

As collaborative learning allows joint training of a model using multiple sources of data, the security problem has been a central concern. Malicious users can upload poisoned data to prevent the model's convergence or inject hidden backdoors. The so-called backdoor attacks are especially difficult to detect since the model behaves normally on standard test data but gives wrong outputs when triggered by certain backdoor keys. Although Byzantine-tolerant training algorithms provide convergence guarantee, provable defense against backdoor attacks remains largely unsolved. Methods based on randomized smoothing can only correct a small number of corrupted pixels or labels; methods based on subset aggregation cause a severe drop in classification accuracy due to low data utilization. We propose a novel framework that generalizes existing subset aggregation methods. The framework shows that the subset selection process, a deciding factor for subset aggregation methods, can be viewed as a code design problem. We derive the theoretical bound of data utilization ratio and provide optimal code construction. Experiments on non-IID versions of MNIST and CIFAR-10 show that our method with optimal codes significantly outperforms baselines using non-overlapping partition and random selection. Additionally, integration with existing coding theory results shows that special codes can track the location of the attackers. Such capability provides new countermeasures to backdoor attacks.

preprint2021arXiv

Physics-Based Models for Magneto-Electric Spin-Orbit Logic Circuits

Spintronic devices are a promising beyond-CMOS device option thanks to their energy efficiency and compatibility with CMOS. To accurately capture their multi-physics dynamics, a rigorous treatment of both spin and charge and their inter-conversion is required. Here we present physics-based device models based on 4x4 matrices for the spin-orbit coupling part of the magneto-electric spin-orbit (MESO) device. Also, a more rigorous physics model of ferroelectric and magnetoelectric switching of ferromagnets, based on Landau-Lifshitz-Gilbert (LLG) and Landau-Khalatnikov (LK) equations, is presented. With the combined model implemented in a SPICE circuit simulator environment, simulation results were obtained which show feasibility of MESO implementation and functional operation of buffers, oscillators, and majority gates.

preprint2020arXiv

AutoGrow: Automatic Layer Growing in Deep Convolutional Networks

Depth is a key component of Deep Neural Networks (DNNs), however, designing depth is heuristic and requires many human efforts. We propose AutoGrow to automate depth discovery in DNNs: starting from a shallow seed architecture, AutoGrow grows new layers if the growth improves the accuracy; otherwise, stops growing and thus discovers the depth. We propose robust growing and stopping policies to generalize to different network architectures and datasets. Our experiments show that by applying the same policy to different network architectures, AutoGrow can always discover near-optimal depth on various datasets of MNIST, FashionMNIST, SVHN, CIFAR10, CIFAR100 and ImageNet. For example, in terms of accuracy-computation trade-off, AutoGrow discovers a better depth combination in ResNets than human experts. Our AutoGrow is efficient. It discovers depth within similar time of training a single DNN. Our code is available at https://github.com/wenwei202/autogrow.

preprint2020arXiv

DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures

In seeking for sparse and efficient neural network models, many previous works investigated on enforcing L1 or L0 regularizers to encourage weight sparsity during training. The L0 regularizer measures the parameter sparsity directly and is invariant to the scaling of parameter values, but it cannot provide useful gradients, and therefore requires complex optimization techniques. The L1 regularizer is almost everywhere differentiable and can be easily optimized with gradient descent. Yet it is not scale-invariant, causing the same shrinking rate to all parameters, which is inefficient in increasing sparsity. Inspired by the Hoyer measure (the ratio between L1 and L2 norms) used in traditional compressed sensing problems, we present DeepHoyer, a set of sparsity-inducing regularizers that are both differentiable almost everywhere and scale-invariant. Our experiments show that enforcing DeepHoyer regularizers can produce even sparser neural network models than previous works, under the same accuracy level. We also show that DeepHoyer can be applied to both element-wise and structural pruning.

preprint2020arXiv

Defending against GAN-based Deepfake Attacks via Transformation-aware Adversarial Faces

Deepfake represents a category of face-swapping attacks that leverage machine learning models such as autoencoders or generative adversarial networks. Although the concept of the face-swapping is not new, its recent technical advances make fake content (e.g., images, videos) more realistic and imperceptible to Humans. Various detection techniques for Deepfake attacks have been explored. These methods, however, are passive measures against Deepfakes as they are mitigation strategies after the high-quality fake content is generated. More importantly, we would like to think ahead of the attackers with robust defenses. This work aims to take an offensive measure to impede the generation of high-quality fake images or videos. Specifically, we propose to use novel transformation-aware adversarially perturbed faces as a defense against GAN-based Deepfake attacks. Different from the naive adversarial faces, our proposed approach leverages differentiable random image transformations during the generation. We also propose to use an ensemble-based approach to enhance the defense robustness against GAN-based Deepfake variants under the black-box setting. We show that training a Deepfake model with adversarial faces can lead to a significant degradation in the quality of synthesized faces. This degradation is twofold. On the one hand, the quality of the synthesized faces is reduced with more visual artifacts such that the synthesized faces are more obviously fake or less convincing to human observers. On the other hand, the synthesized faces can easily be detected based on various metrics.

preprint2020arXiv

Gridless Super-Resolution Sparse Recovery for Non-sidelooking STAP using Reweighted Atomic Norm Minimization

Sparse recovery Space-time Adaptive Processing (STAP) can reduce the requirements of clutter samples, and suppress clutter effectively using limited training samples for airborne radar. The whole angle-Doppler plane is discretized into small grid points uniformly in presently available sparse recovery STAP methods, however, the clutter ridge is not located exactly on the pre-discretized grid points in non-sidelooking STAP radar. The off-grid effect degrades the performance of STAP significantly. In this paper, a gridless sparse recovery STAP method is proposed based on reweighted atomic norm minimization, in which the clutter spectrum is precisely estimated in continuous angle-Doppler plane without resolution limit. Numerical results show that the proposed method provides an improved performance to the sparse recovery STAP methods with discretized dictionaries and STAP method utilizing atomic norm minimization.

preprint2020arXiv

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.

preprint2020arXiv

Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification

Modern deep neural networks (DNNs) often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods. Factorization methods approximate the weight matrix of a DNN layer with the multiplication of two or multiple low-rank matrices. However, it is hard to measure the ranks of DNN layers during the training process. Previous works mainly induce low-rank through implicit approximations or via costly singular value decomposition (SVD) process on every training step. The former approach usually induces a high accuracy loss while the latter has a low efficiency. In this work, we propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step. SVD training first decomposes each layer into the form of its full-rank SVD, then performs training directly on the decomposed weights. We add orthogonality regularization to the singular vectors, which ensure the valid form of SVD and avoid gradient vanishing/exploding. Low-rank is encouraged by applying sparsity-inducing regularizers on the singular values of each layer. Singular value pruning is applied at the end to explicitly reach a low-rank model. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy, comparing to not only previous factorization methods but also state-of-the-art filter pruning methods.

preprint2020arXiv

LotteryFL: Personalized and Communication-Efficient Federated Learning with Lottery Ticket Hypothesis on Non-IID Datasets

Federated learning is a popular distributed machine learning paradigm with enhanced privacy. Its primary goal is learning a global model that offers good performance for the participants as many as possible. The technology is rapidly advancing with many unsolved challenges, among which statistical heterogeneity (i.e., non-IID) and communication efficiency are two critical ones that hinder the development of federated learning. In this work, we propose LotteryFL -- a personalized and communication-efficient federated learning framework via exploiting the Lottery Ticket hypothesis. In LotteryFL, each client learns a lottery ticket network (i.e., a subnetwork of the base model) by applying the Lottery Ticket hypothesis, and only these lottery networks will be communicated between the server and clients. Rather than learning a shared global model in classic federated learning, each client learns a personalized model via LotteryFL; the communication cost can be significantly reduced due to the compact size of lottery networks. To support the training and evaluation of our framework, we construct non-IID datasets based on MNIST, CIFAR-10 and EMNIST by taking feature distribution skew, label distribution skew and quantity skew into consideration. Experiments on these non-IID datasets demonstrate that LotteryFL significantly outperforms existing solutions in terms of personalization and communication cost.

preprint2020arXiv

PENNI: Pruned Kernel Sharing for Efficient CNN Inference

Although state-of-the-art (SOTA) CNNs achieve outstanding performance on various tasks, their high computation demand and massive number of parameters make it difficult to deploy these SOTA CNNs onto resource-constrained devices. Previous works on CNN acceleration utilize low-rank approximation of the original convolution layers to reduce computation cost. However, these methods are very difficult to conduct upon sparse models, which limits execution speedup since redundancies within the CNN model are not fully exploited. We argue that kernel granularity decomposition can be conducted with low-rank assumption while exploiting the redundancy within the remaining compact coefficients. Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints. Experiments show that we can prune 97% parameters and 92% FLOPs on ResNet18 CIFAR10 with no accuracy loss, and achieve 44% reduction in run-time memory consumption and a 53% reduction in inference latency.

preprint2020arXiv

Reinforcement Learning-based Black-Box Evasion Attacks to Link Prediction in Dynamic Graphs

Link prediction in dynamic graphs (LPDG) is an important research problem that has diverse applications such as online recommendations, studies on disease contagion, organizational studies, etc. Various LPDG methods based on graph embedding and graph neural networks have been recently proposed and achieved state-of-the-art performance. In this paper, we study the vulnerability of LPDG methods and propose the first practical black-box evasion attack. Specifically, given a trained LPDG model, our attack aims to perturb the graph structure, without knowing to model parameters, model architecture, etc., such that the LPDG model makes as many wrong predicted links as possible. We design our attack based on a stochastic policy-based RL algorithm. Moreover, we evaluate our attack on three real-world graph datasets from different application domains. Experimental results show that our attack is both effective and efficient.

preprint2020arXiv

Snooping Attacks on Deep Reinforcement Learning

Adversarial attacks have exposed a significant security vulnerability in state-of-the-art machine learning models. Among these models include deep reinforcement learning agents. The existing methods for attacking reinforcement learning agents assume the adversary either has access to the target agent's learned parameters or the environment that the agent interacts with. In this work, we propose a new class of threat models, called snooping threat models, that are unique to reinforcement learning. In these snooping threat models, the adversary does not have the ability to interact with the target agent's environment, and can only eavesdrop on the action and reward signals being exchanged between agent and environment. We show that adversaries operating in these highly constrained threat models can still launch devastating attacks against the target agent by training proxy models on related tasks and leveraging the transferability of adversarial examples.

preprint2020arXiv

Structural sparsification for Far-field Speaker Recognition with GNA

Recently, deep neural networks (DNN) have been widely used in speaker recognition area. In order to achieve fast response time and high accuracy, the requirements for hardware resources increase rapidly. However, as the speaker recognition application is often implemented on mobile devices, it is necessary to maintain a low computational cost while keeping high accuracy in far-field condition. In this paper, we apply structural sparsification on time-delay neural networks (TDNN) to remove redundant structures and accelerate the execution. On our targeted hardware, our model can remove 60% of parameters and only slightly increasing equal error rate (EER) by 0.18% while our structural sparse model can achieve more than 1.5x speedup.

preprint2019arXiv

AutoShrink: A Topology-aware NAS for Discovering Efficient Neural Architecture

Resource is an important constraint when deploying Deep Neural Networks (DNNs) on mobile and edge devices. Existing works commonly adopt the cell-based search approach, which limits the flexibility of network patterns in learned cell structures. Moreover, due to the topology-agnostic nature of existing works, including both cell-based and node-based approaches, the search process is time consuming and the performance of found architecture may be sub-optimal. To address these problems, we propose AutoShrink, a topology-aware Neural Architecture Search(NAS) for searching efficient building blocks of neural architectures. Our method is node-based and thus can learn flexible network patterns in cell structures within a topological search space. Directed Acyclic Graphs (DAGs) are used to abstract DNN architectures and progressively optimize the cell structure through edge shrinking. As the search space intrinsically reduces as the edges are progressively shrunk, AutoShrink explores more flexible search space with even less search time. We evaluate AutoShrink on image classification and language tasks by crafting ShrinkCNN and ShrinkRNN models. ShrinkCNN is able to achieve up to 48% parameter reduction and save 34% Multiply-Accumulates (MACs) on ImageNet-1K with comparable accuracy of state-of-the-art (SOTA) models. Specifically, both ShrinkCNN and ShrinkRNN are crafted within 1.5 GPU hours, which is 7.2x and 6.7x faster than the crafting time of SOTA CNN and RNN models, respectively.

preprint2016arXiv

A New Learning Method for Inference Accuracy, Core Occupation, and Performance Co-optimization on TrueNorth Chip

IBM TrueNorth chip uses digital spikes to perform neuromorphic computing and achieves ultrahigh execution parallelism and power efficiency. However, in TrueNorth chip, low quantization resolution of the synaptic weights and spikes significantly limits the inference (e.g., classification) accuracy of the deployed neural network model. Existing workaround, i.e., averaging the results over multiple copies instantiated in spatial and temporal domains, rapidly exhausts the hardware resources and slows down the computation. In this work, we propose a novel learning method on TrueNorth platform that constrains the random variance of each computation copy and reduces the number of needed copies. Compared to the existing learning method, our method can achieve up to 68.8% reduction of the required neuro-synaptic cores or 6.5X speedup, with even slightly improved inference accuracy.

preprint2016arXiv

Learning Structured Sparsity in Deep Neural Networks

High demand for computation resources severely hinders deployment of large-scale Deep Neural Networks (DNN) in resource constrained devices. In this work, we propose a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNNs evaluation. Experimental results show that SSL achieves on average 5.1x and 3.1x speedups of convolutional layer computation of AlexNet against CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about twice speedups of non-structured sparsity; (3) regularize the DNN structure to improve classification accuracy. The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network (ResNet) to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers. For AlexNet, structure regularization by SSL also reduces the error by around ~1%. Open source code is in https://github.com/wenwei202/caffe/tree/scnn

preprint2014arXiv

A Universal, Rapid Method for Clean Transfer of Nanostructures onto Various Substrates

Transfer and integration of nanostructures onto target substrates is the prerequisite for their fundamental studies and practical applications. Conventional transfer techniques that involve stamping, lift-off and/or striping suffer from the process-specific drawbacks, such as the requirement for chemical etchant or high-temperature annealing and the introduction of surface discontinuities and/or contaminations that can greatly hinder the properties and functions of the transferred materials. Herein, we report a universal and rapid transfer method implementable at mild conditions. Nanostructures with various dimensionalities (i.e. nanoparticles, nanowires and nanosheets) and surface properties (i.e. hydrophilic and hydrophobic) can be facilely transferred to diverse substrates including hydrophilic, hydrophobic and flexible surfaces with good fidelity. Importantly, our method ensures the rapid and clean transfer of two-dimensional (2D) materials, and allows for the facile fabrication of vertical heterostructures with various compositions used for electronic devices. We believe that our method can facilitate the development of nano-electronics by accelerating the clean transfer and integration of low-dimensional materials into multidimensional structures.

preprint2014arXiv

Quantum coherence rather than quantum correlations reflect the effects of reservoir on the system's work capability

We consider a model of an optical cavity with a nonequilibrium reservoir consisting of a beam of identical two-level atom pairs (TLAPs) in the general X-state. We find that coherence of multiparticle nonequilibrium reservoir plays a central role on the potential work capability of cavity. We show that no matter whether there are quantum correlations in each TLAP (including quantum entanglement and quantum discord) or not the coherence of the TLAPs has an effect on the work capability of the cavity. Additionally, constructive and destructive interferences could be induced to influence the work capability of cavity only by adjusting the relative phase with which quantum correlations have nothing to do. In this paper, the coherence of reservoir rather than the quantum correlations effectively reflecting the effects of reservoir on the system's work capability is demonstrated clearly.

preprint2013arXiv

Interlayer breathing and shear modes in few-trilayer MoS2 and WSe2

Two-dimensional (2D) layered transition metal dichalcogenides (TMDs) have recently attracted tremendous interest as potential valleytronic and nano-electronic materials, in addition to being well-known as excellent lubricants in the bulk. The interlayer van der Waals (vdW) coupling and low frequency phonon modes, and how they evolve with the number of layers, are important for both the mechanical and electrical properties of 2D TMDs. Here we uncover the ultra-low frequency interlayer breathing and shear modes in few-layer MoS2 and WSe2, prototypical layered TMDs, using both Raman spectroscopy and first principles calculations. Remarkably, the frequencies of these modes can be perfectly described using a simple linear chain model with only nearest-neighbour interactions. We show that the derived in-plane (shear) and out-of-plane (breathing) force constants from experiment remain the same from two-layer 2D crystals to the bulk materials, suggesting that the nanoscale interlayer frictional characteristics of these excellent lubricants should be independent of the number of layers.

preprint2013arXiv

Investigation of MoS2 and Graphene Nanosheets by Magnetic Force Microscopy

For the first time, the magnetic force microscopy (MFM) is used to characterize the mechanically-exfoliated single- and few-layer MoS2 and graphene nanosheets. By analysis of the phase and amplitude shifts, the magnetic response of MoS2 and graphene nanosheets exhibits the dependence on their layer number. However, the solution-processed single-layer MoS2 nanosheet shows the reverse magnetic signal to the mechanically-exfoliated one, and the graphene oxide nanosheet has not shown any detectable magnetic signal. Importantly, graphene and MoS2 flakes become nonmagnetic when they exceed a certain thickness.

preprint2013arXiv

Rapid and reliable thickness identification of two-dimensional nanosheets using optical microscopy

The physical and electronic properties of ultrathin two-dimensional (2D) layered nanomaterials are highly related to their thickness. Therefore, the rapid and accurate identification of single- and few- to multi-layer nanosheets is essential to their fundamental study and practical applications. Here, a universal optical method has been developed for simple, rapid and reliable identification of single- to quindecuple-layer (1L-15L) 2D nanosheets, including graphene, MoS2, WSe2 and TaS2, on Si substrates coated with 90 nm or 300 nm SiO2. The optical contrast differences between the substrates and 2D nanosheets with different layer numbers were collected and tabulated, serving as a standard reference, from which the layer number of a given nanosheet can be readily and reliably determined without using complex calculation nor expensive instrument. Our general optical identification method will facilitate the thickness-dependent study of various 2D nanomaterials, and expedite their research toward practical applications.

preprint2013arXiv

Single-Layer MoS2 Phototransistors

A new phototransistor based on the mechanically-exfoliated single-layer MoS2 nanosheet is fabricated and its light-induced electric properties are investigated in details. Photocurrent generated from the phototransistor is solely determined by the illuminated optical power at a constant drain or gate voltage. The switching behavior of photocurrent generation and annihilation can be completely finished within ca. 50 ms and it shows good stability. Especially, the single-layer MoS2 phototransistor exhibits a better photoresponsivity as compared with the graphene-based device. The unique characteristics of incident-light control, prompt photoswitching and good photoresponsivity from the MoS2 phototransistor pave an avenue to develop the single-layer semiconducting materials for multi-functional optoelectronic device applications in future.

preprint2012arXiv

Growth of Large-Area and Highly Crystalline MoS2 Thin Layers on Insulating Substrates

The two-dimensional layer of molybdenum disulfide (MoS2) has recently attracted much interest due to its direct-gap property and potential applications in optoelectronics and energy harvesting. However, the synthetic approach to obtain high quality and large-area MoS2 atomic thin layers is still rare. Here we report that the high temperature annealing of a thermally decomposed ammonium thiomolybdate layer in the presence of sulfur can produce large-area MoS2 thin layers with superior electrical performance on insulating substrates. Spectroscopic and microscopic results reveal that the synthesized MoS2 sheets are highly crystalline. The electron mobility of the bottom-gate transistor devices made of the synthesized MoS2 layer is comparable with those of the micromechanically exfoliated thin sheets from MoS2 crystals. This synthetic approach is simple, scalable and applicable to other transition metal dichalcogenides. Meanwhile, the obtained MoS2 films are transferable to arbitrary substrates, providing great opportunities to make layered composites by stacking various atomically thin layers.

preprint2012arXiv

Revisiting the quantum Szilard engine with fully quantum considerations

By considering level shifting during the insertion process we revisit the quantum Szilard engine (QSZE) with fully quantum consideration. We derive the general expressions of the heat absorbed from thermal bath and the total work done to the environment by the system in a cycle with two different cyclic strategies. We find that only the quantum information contributes to the absorbed heat, and the classical information acts like a feedback controller and has no direct effect on the absorbed heat. This is the first demonstration of the different effects of quantum information and classical information for extracting heat from the bath in the QSZE. Moreover, when the well width $L\rightarrow \infty $ or the temperature of the bath $T\rightarrow \infty $ the QSZE reduces to the classical Szilard engine (CSZE), and the total work satisfies the relation $W_{\mathtt{tot}}=k_{B}T \mathtt{ln}2$ as obtained by Sang Wook Kim et al. [Phys. Rev. Lett. 106, 070401 (2011)] for one particle case.

Hai Li

What is connected

Connect this record

See the researcher in context

Building this map preview

39 published item(s)

Evaluating Accounting Reasoning Capabilities of Large Language Models

HCE: Improving Performance and Efficiency with Heterogeneously Compressed Neural Network Ensemble

FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning

Field free switching through bulk spin-orbit torque in L10-FePt films deposited on vicinal substrates

Fine-grain Inference on Out-of-Distribution Data with Hierarchical Classification

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

Joint Optimization of STAR-RIS Assisted UAV Communication Systems

Learning and Compositionality: a Unification Attempt via Connectionist Probabilistic Programming

Nonvolatile Electric Field Control of Thermal Magnons in the Absence of an Applied Magnetic Field

ReaLPrune: ReRAM Crossbar-aware Lottery Ticket Pruned CNNs

Relevance between Information scrambling and quantum Darwinism

A Case for 3D Integrated System Design for Neuromorphic Computing & AI Applications

Anisotropic Andreev reflection in semi-Dirac materials

BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

Improving Adversarial Robustness in Weight-quantized Neural Networks

On Provable Backdoor Defense in Collaborative Learning

Physics-Based Models for Magneto-Electric Spin-Orbit Logic Circuits

AutoGrow: Automatic Layer Growing in Deep Convolutional Networks

DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures

Defending against GAN-based Deepfake Attacks via Transformation-aware Adversarial Faces

Gridless Super-Resolution Sparse Recovery for Non-sidelooking STAP using Reweighted Atomic Norm Minimization

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification

LotteryFL: Personalized and Communication-Efficient Federated Learning with Lottery Ticket Hypothesis on Non-IID Datasets

PENNI: Pruned Kernel Sharing for Efficient CNN Inference

Reinforcement Learning-based Black-Box Evasion Attacks to Link Prediction in Dynamic Graphs

Snooping Attacks on Deep Reinforcement Learning

Structural sparsification for Far-field Speaker Recognition with GNA

AutoShrink: A Topology-aware NAS for Discovering Efficient Neural Architecture

A New Learning Method for Inference Accuracy, Core Occupation, and Performance Co-optimization on TrueNorth Chip

Learning Structured Sparsity in Deep Neural Networks

A Universal, Rapid Method for Clean Transfer of Nanostructures onto Various Substrates

Quantum coherence rather than quantum correlations reflect the effects of reservoir on the system's work capability

Interlayer breathing and shear modes in few-trilayer MoS2 and WSe2

Investigation of MoS2 and Graphene Nanosheets by Magnetic Force Microscopy

Rapid and reliable thickness identification of two-dimensional nanosheets using optical microscopy

Single-Layer MoS2 Phototransistors

Growth of Large-Area and Highly Crystalline MoS2 Thin Layers on Insulating Substrates

Revisiting the quantum Szilard engine with fully quantum considerations