Source author record

Guangyu Sun

Guangyu Sun appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Hardware Architecture Computer Vision cond-mat.mtrl-sci cond-mat.str-el Databases eess.SP

Catalog footprint

What is connected

9works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Survey of Trustworthy Graph Learning: Reliability, Explainability, and Privacy Protection

Deep graph learning has achieved remarkable progresses in both business and scientific areas ranging from finance and e-commerce, to drug and advanced material discovery. Despite these progresses, how to ensure various deep graph learning algorithms behave in a socially responsible manner and meet regulatory compliance requirements becomes an emerging problem, especially in risk-sensitive domains. Trustworthy graph learning (TwGL) aims to solve the above problems from a technical viewpoint. In contrast to conventional graph learning research which mainly cares about model performance, TwGL considers various reliability and safety aspects of the graph learning framework including but not limited to robustness, explainability, and privacy. In this survey, we provide a comprehensive review of recent leading approaches in the TwGL field from three dimensions, namely, reliability, explainability, and privacy protection. We give a general categorization for existing work and review typical work for each category. To give further insights for TwGL research, we provide a unified view to inspect previous works and build the connection between them. We also point out some important open problems remaining to be solved in the future developments of TwGL.

preprint2022arXiv

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

In recent years, transformer models have revolutionized Natural Language Processing (NLP) and shown promising performance on Computer Vision (CV) tasks. Despite their effectiveness, transformers' attention operations are hard to accelerate due to the complicated data movement and quadratic computational complexity, prohibiting the real-time inference on resource-constrained edge-computing platforms. To tackle this challenge, we propose Energon, an algorithm-architecture co-design approach that accelerates various transformers using dynamic sparse attention. With the observation that attention results only depend on a few important query-key pairs, we propose a Mix-Precision Multi-Round Filtering (MP-MRF) algorithm to dynamically identify such pairs at runtime. We adopt low bitwidth in each filtering round and only use high-precision tensors in the attention stage to reduce overall complexity. By this means, we significantly mitigate the computational cost with negligible accuracy loss. To enable such an algorithm with lower latency and better energy efficiency, we also propose an Energon co-processor architecture. Elaborated pipelines and specialized optimizations jointly boost the performance and reduce power consumption. Extensive experiments on both NLP and CV benchmarks demonstrate that Energon achieves $168\times$ and $8.7\times$ geo-mean speedup and up to $10^4\times$ and $10^3\times$ energy reduction compared with Intel Xeon 5220 CPU and NVIDIA V100 GPU. Compared to state-of-the-art attention accelerators SpAtten and $A^3$, Energon also achieves $1.7\times, 1.25\times$ speedup and $1.6 \times, 1.5\times $ higher energy efficiency.

preprint2022arXiv

GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing

Recently, Graph Neural Networks (GNNs) have become state-of-the-art algorithms for analyzing non-euclidean graph data. However, to realize efficient GNN training is challenging, especially on large graphs. The reasons are many-folded: 1) GNN training incurs a substantial memory footprint. Full-batch training on large graphs even requires hundreds to thousands of gigabytes of memory. 2) GNN training involves both memory-intensive and computation-intensive operations, challenging current CPU/GPU platforms. 3) The irregularity of graphs can result in severe resource under-utilization and load-imbalance problems. This paper presents a GNNear accelerator to tackle these challenges. GNNear adopts a DIMM-based memory system to provide sufficient memory capacity. To match the heterogeneous nature of GNN training, we offload the memory-intensive Reduce operations to in-DIMM Near-Memory-Engines (NMEs), making full use of the high aggregated local bandwidth. We adopt a Centralized-Acceleration-Engine (CAE) to process the computation-intensive Update operations. We further propose several optimization strategies to deal with the irregularity of input graphs and improve GNNear's performance. Comprehensive evaluations on 16 GNN training tasks demonstrate that GNNear achieves 30.8$\times$/2.5$\times$ geomean speedup and 79.6$\times$/7.3$\times$(geomean) higher energy efficiency compared to Xeon E5-2698-v4 CPU and NVIDIA V100 GPU.

preprint2022arXiv

GNNSampler: Bridging the Gap between Sampling Algorithms of GNN and Hardware

Sampling is a critical operation in Graph Neural Network (GNN) training that helps reduce the cost. Previous literature has explored improving sampling algorithms via mathematical and statistical methods. However, there is a gap between sampling algorithms and hardware. Without consideration of hardware, algorithm designers merely optimize sampling at the algorithm level, missing the great potential of promoting the efficiency of existing sampling algorithms by leveraging hardware features. In this paper, we pioneer to propose a unified programming model for mainstream sampling algorithms, termed GNNSampler, covering the critical processes of sampling algorithms in various categories. Second, to leverage the hardware feature, we choose the data locality as a case study, and explore the data locality among nodes and their neighbors in a graph to alleviate irregular memory access in sampling. Third, we implement locality-aware optimizations in GNNSampler for various sampling algorithms to optimize the general sampling process. Finally, we emphatically conduct experiments on large graph datasets to analyze the relevance among training time, accuracy, and hardware-level metrics. Extensive experiments show that our method is universal to mainstream sampling algorithms and helps significantly reduce the training time, especially in large-scale graphs.

preprint2020arXiv

Deep Learning Detection of Inaccurate Smart Electricity Meters: A Case Study

Detecting inaccurate smart meters and targeting them for replacement can save significant resources. For this purpose, a novel deep-learning method was developed based on long short-term memory (LSTM) and a modified convolutional neural network (CNN) to predict electricity usage trajectories based on historical data. From the significant difference between the predicted trajectory and the observed one, the meters that cannot measure electricity accurately are located. In a case study, a proof of principle was demonstrated in detecting inaccurate meters with high accuracy for practical usage to prevent unnecessary replacement and increase the service life span of smart meters.

preprint2020arXiv

ENAS4D: Efficient Multi-stage CNN Architecture Search for Dynamic Inference

Dynamic inference is a feasible way to reduce the computational cost of convolutional neural network(CNN), which can dynamically adjust the computation for each input sample. One of the ways to achieve dynamic inference is to use multi-stage neural network, which contains a sub-network with prediction layer at each stage. The inference of a input sample can exit from early stage if the prediction of the stage is confident enough. However, design a multi-stage CNN architecture is a non-trivial task. In this paper, we introduce a general framework, ENAS4D, which can efficiently search for optimal multi-stage CNN architecture for dynamic inference in a well-designed search space. Firstly, we propose a method to construct the search space with multi-stage convolution. The search space include different numbers of layers, different kernel sizes and different numbers of channels for each stage and the resolution of input samples. Then, we train a once-for-all network that supports to sample diverse multi-stage CNN architecture. A specialized multi-stage network can be obtained from the once-for-all network without additional training. Finally, we devise a method to efficiently search for the optimal multi-stage network that trades the accuracy off the computational cost taking the advantage of once-for-all network. The experiments on the ImageNet classification task demonstrate that the multi-stage CNNs searched by ENAS4D consistently outperform the state-of-the-art method for dyanmic inference. In particular, the network achieves 74.4% ImageNet top-1 accuracy under 185M average MACs.

preprint2020arXiv

Quantum phases of SrCu2(BO3)2 from high-pressure thermodynamics

We report heat capacity measurements of SrCu$_2$(BO$_3$)$_2$ under high pressure along with simulations of relevant quantum spin models and map out the $(P,T)$ phase diagram of the material. We find a first-order quantum phase transition between the low-pressure quantum dimer paramagnet and a phase with signatures of a plaquette-singlet state below T = $2$ K. At higher pressures, we observe a transition into a previously unknown antiferromagnetic state below $4$ K. Our findings can be explained within the two-dimensional Shastry-Sutherland quantum spin model supplemented by weak inter-layer couplings. The possibility to tune SrCu$_2$(BO$_3$)$_2$ between the plaquette-singlet and antiferromagnetic states opens opportunities for experimental tests of quantum field theories and lattice models involving fractionalized excitations, emergent symmetries, and gauge fluctuations.

preprint2016arXiv

Perspectives of Racetrack Memory for Large-Capacity On-Chip Memory: From Device to System

Current-induced domain wall motion (CIDWM) is regarded as a promising way towards achieving emerging high-density, high-speed and low-power non-volatile devices. Racetrack memory is an attractive spintronic memory based on this phenomenon, which can store and transfer a series of data along a magnetic nanowire. However, storage capacity issue is always one of the most serious bottlenecks hindering its application for practical systems. This paper focuses on the potential of racetrack memory towards large capacity. The investigations covering from device level to system level have been carried out. Various alternative mechanisms to improve the capacity of racetrack memory have been proposed and elucidated, e.g. magnetic field assistance, chiral DW motion and voltage-controlled flexible DW pinning. All of them can increase nanowire length, allowing enhanced feasibility of large-capacity racetrack memory. By using SPICE compatible racetrack memory electrical model and commercial CMOS 28 nm design kit, mixed simulations are performed to validate their functionalities and analyze their performance. System level evaluations demonstrate the impact of capacity improvement on overall system. Compared with traditional SRAM based cache, racetrack memory based cache shows its advantages in terms of execution time and energy consumption.

preprint2015arXiv

NXgraph: An Efficient Graph Processing System on a Single Machine

Recent studies show that graph processing systems on a single machine can achieve competitive performance compared with cluster-based graph processing systems. In this paper, we present NXgraph, an efficient graph processing system on a single machine. With the abstraction of vertex intervals and edge sub-shards, we propose the Destination-Sorted Sub-Shard (DSSS) structure to store a graph. By dividing vertices and edges into intervals and sub-shards, NXgraph ensures graph data access locality and enables fine-grained scheduling. By sorting edges within each sub-shard according to their destination vertices, NXgraph reduces write conflicts among different threads and achieves a high degree of parallelism. Then, three updating strategies, i.e., Single-Phase Update (SPU), Double-Phase Update (DPU), and Mixed-Phase Update (MPU), are proposed in this paper. NXgraph can adaptively choose the fastest strategy for different graph problems according to the graph size and the available memory resources to fully utilize the memory space and reduce the amount of data transfer. All these three strategies exploit streamlined disk access pattern. Extensive experiments on three real-world graphs and five synthetic graphs show that NXgraph can outperform GraphChi, TurboGraph, VENUS, and GridGraph in various situations. Moreover, NXgraph, running on a single commodity PC, can finish an iteration of PageRank on the Twitter graph with 1.5 billion edges in 2.05 seconds; while PowerGraph, a distributed graph processing system, needs 3.6s to finish the same task.

Guangyu Sun

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

A Survey of Trustworthy Graph Learning: Reliability, Explainability, and Privacy Protection

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing

GNNSampler: Bridging the Gap between Sampling Algorithms of GNN and Hardware

Deep Learning Detection of Inaccurate Smart Electricity Meters: A Case Study

ENAS4D: Efficient Multi-stage CNN Architecture Search for Dynamic Inference

Quantum phases of SrCu2(BO3)2 from high-pressure thermodynamics

Perspectives of Racetrack Memory for Large-Capacity On-Chip Memory: From Device to System

NXgraph: An Efficient Graph Processing System on a Single Machine