Researcher profile

Xueqing Li

Xueqing Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2022arXiv

FAST: A Fully-Concurrent Access Technique to All SRAM Rows for Enhanced Speed and Energy Efficiency in Data-Intensive Applications

Compute-in-memory (CiM) is a promising approach to improving the computing speed and energy efficiency in dataintensive applications. Beyond existing CiM techniques of bitwise logic-in-memory operations and dot product operations, this paper extends the CiM paradigm with FAST, a new shift-based inmemory computation technique to handle high-concurrency operations on multiple rows in an SRAM. Such high-concurrency operations are widely seen in both conventional applications (e.g. the table update in a database), and emerging applications (e.g. the parallel weight update in neural network accelerators), in which low latency and low energy consumption are critical. The proposed shift-based CiM architecture is enabled by integrating the shifter function into each SRAM cell, and by creating a datapath that exploits the high-parallelism of shifting operations in multiple rows in the array. A 128-row 16-column shiftable SRAM in 65nm CMOS is designed to evaluate the proposed architecture. Postlayout SPICE simulations show average improvements of 4.4x energy efficiency and 96.0x speed over a conventional fully-digital memory-computing-separated scheme, when performing the 8-bit weight update task in a VGG-7 framework.

preprint2022arXiv

Ferroelectric FET-based strong physical unclonable function: a low-power, high-reliable and reconfigurable solution for Internet-of-Things security

Hardware security has been a key concern in modern information technologies. Especially, as the number of Internet-of-Things (IoT) devices grows rapidly, to protect the device security with low-cost security primitives becomes essential, among which Physical Unclonable Function (PUF) is a widely-used solution. In this paper, we propose the first FeFET-based strong PUF exploiting the cycle-to-cycle (C2C) variation of FeFETs as the entropy source. Based on the experimental measurements, the proposed PUF shows satisfying performance including high uniformity, uniqueness, reconfigurability and reliability. To resist machine-learning attack, XOR structure was introduced, and simulations show that our proposed PUF has similar resistance to existing attack models with traditional arbiter PUFs. Furthermore, our design is shown to be power-efficient, and highly robust to write voltage, temperature and device size, which makes it a competitive security solution for Internet-of-Things edge devices.

preprint2022arXiv

GRAPHIC: GatheR-And-Process in Highly parallel with In-SSD Compression Architecture in Very Large-Scale Graph

Graph convolutional network (GCN), an emerging algorithm for graph computing, has achieved promising performance in graphstructure tasks. To achieve acceleration for data-intensive and sparse graph computing, ASICs such as GCNAX have been proposed for efficient execution of aggregation and combination in GCN. GCNAX reducing 8x DRAM accesses compared with previous efforts. However, as graphs have reached terabytes in size, off-chip data movement from SSD to DRAM becomes a serious latency bottleneck. This paper proposes Compressive Graph Transmission (CGTrans), which performs the aggregation in SSD to dramatically relieves the transfer latency bottleneck due to SSD loading compared to CMOS-based graph accelerator ASICs. InSSD computing technique is required for CGTrans. Recently, Insider was proposed as a near-SSD processing system computing by integrating FPGA in SSD. However, the Insider still suffers low area efficiency, which will limit the performance of CGTrans. The recently proposed Fully Concurrent Access Technique (FAST) is utilized. FAST-GAS, as an in-SSD graph computing accelerator, is proposed to provide high-concurrent gather-andscatter operations to overcome the area efficiency problem. We proposed the GRAPHIC system containing CGTrans dataflow deployed on FAST-GAS. Experiments show CGTrans reduces SSD loading by a factor of 50x, while GRAPHIC achieves 3.6x, and 2.4x speedup on average over GCNAX and CGTrans on Insider, respectively.

preprint2022arXiv

YOLoC: DeploY Large-Scale Neural Network by ROM-based Computing-in-Memory using ResiduaL Branch on a Chip

Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to ahieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (Darknet-19) and 4.8x for ResNet-18, with <8% latency overhead and almost no mean average precision (mAP) loss (-0.5% ~ +0.2%), compared with the fully SRAM-based CiM.

preprint2021arXiv

Dynamic Ternary Content-Addressable Memory Is Indeed Promising: Design and Benchmarking Using Nanoelectromechanical Relays

Ternary content addressable memory (TCAM) has been a critical component in caches, routers, etc., in which density, speed, power efficiency, and reliability are the major design targets. There have been the conventional low-write-power but bulky SRAM-based TCAM design, and also denser but less reliable or higher-write-power TCAM designs using nonvolatile memory (NVM) devices. Meanwhile, some TCAM designs using dynamic memories have been also proposed. Although dynamic design TCAM is denser than CMOS SRAM TCAM and more reliable than NVM TCAM, the conventional row-by-row refresh operations land up with a bottleneck of interference with normal TCAM activities. Therefore, this paper proposes a custom low-power dynamic TCAM using nanoelectromechanical (NEM) relay devices utilizing one-shot refresh to solve the memory refresh problem. By harnessing the unique NEM relay characteristics with a proposed novel cell structure, the proposed TCAM occupies a small footprint of only 3 transistors (with two NEM relays integrated on the top through the back-end-of-line process), which significantly outperforms the density of 16-transistor SRAM-based TCAM. In addition, evaluations show that the proposed TCAM improves the write energy efficiency by 2.31x, 131x, and 13.5x over SRAM, RRAM, and FeFET TCAMs, respectively; The search energy-delay-product is improved by 12.7x, 1.30x, and 2.83x over SRAM, RRAM, and FeFET TCAMs, respectively.

preprint2021arXiv

Enabling Lower-Power Charge-Domain Nonvolatile In-Memory Computing with Ferroelectric FETs

Compute-in-memory (CiM) is a promising approach to alleviating the memory wall problem for domain-specific applications. Compared to current-domain CiM solutions, charge-domain CiM shows the opportunity for higher energy efficiency and resistance to device variations. However, the area occupation and standby leakage power of existing SRAMbased charge-domain CiM (CD-CiM) are high. This paper proposes the first concept and analysis of CD-CiM using nonvolatile memory (NVM) devices. The design implementation and performance evaluation are based on a proposed 2-transistor-1-capacitor (2T1C) CiM macro using ferroelectric field-effect-transistors (FeFETs), which is free from leakage power and much denser than the SRAM solution. With the supply voltage between 0.45V and 0.90V, operating frequency between 100MHz to 1.0GHz, binary neural network application simulations show over 47%, 60%, and 64% energy consumption reduction from existing SRAM-based CD-CiM, SRAM-based current-domain CiM, and RRAM-based current-domain CiM, respectively. For classifications in MNIST and CIFAR-10 data sets, the proposed FeFETbased CD-CiM achieves an accuracy over 95% and 80%, respectively.