Source author record

Zhongfeng Wang

Zhongfeng Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision eess.SP Hardware Architecture Information Theory math.IT

Catalog footprint

What is connected

3works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer

Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6x in the MHA ResBlock, and 3.4x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.

preprint2016arXiv

Intra-layer Nonuniform Quantization for Deep Convolutional Neural Network

Deep convolutional neural network (DCNN) has achieved remarkable performance on object detection and speech recognition in recent years. However, the excellent performance of a DCNN incurs high computational complexity and large memory requirement. In this paper, an equal distance nonuniform quantization (ENQ) scheme and a K-means clustering nonuniform quantization (KNQ) scheme are proposed to reduce the required memory storage when low complexity hardware or software implementations are considered. For the VGG-16 and the AlexNet, the proposed nonuniform quantization schemes reduce the number of required memory storage by approximately 50\% while achieving almost the same or even better classification accuracy compared to the state-of-the-art quantization method. Compared to the ENQ scheme, the proposed KNQ scheme provides a better tradeoff when higher accuracy is required.

preprint2012arXiv

Reduced-Complexity Column-Layered Decoding and Implementation for LDPC Codes

Layered decoding is well appreciated in Low-Density Parity-Check (LDPC) decoder implementation since it can achieve effectively high decoding throughput with low computation complexity. This work, for the first time, addresses low complexity column-layered decoding schemes and VLSI architectures for multi-Gb/s applications. At first, the Min-Sum algorithm is incorporated into the column-layered decoding. Then algorithmic transformations and judicious approximations are explored to minimize the overall computation complexity. Compared to the original column-layered decoding, the new approach can reduce the computation complexity in check node processing for high-rate LDPC codes by up to 90% while maintaining the fast convergence speed of layered decoding. Furthermore, a relaxed pipelining scheme is presented to enable very high clock speed for VLSI implementation. Equipped with these new techniques, an efficient decoder architecture for quasi-cyclic LDPC codes is developed and implemented with 0.13um CMOS technology. It is shown that a decoding throughput of nearly 4 Gb/s at maximum of 10 iterations can be achieved for a (4096, 3584) LDPC code. Hence, this work has facilitated practical applications of column-layered decoding and particularly made it very attractive in high-speed, high-rate LDPC decoder implementation.