Source author record

Yaoyu Tao

Yaoyu Tao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture Information Theory math.IT Artificial Intelligence eess.SP Machine Learning

Catalog footprint

What is connected

4works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

An Automated FPGA-based Framework for Rapid Prototyping of Nonbinary LDPC Codes

Nonbinary LDPC codes have shown superior performance close to the Shannon limit. Compared to binary LDPC codes of similar lengths, they can reach orders of magnitudes lower error rate. However, multitude of design freedoms of nonbinary LDPC codes complicates the practical code and decoder design process. Fast simulations are critically important to evaluate the pros and cons. Rapid prototyping on FPGA is attractive but takes significant design efforts due to its high design complexity. We propose a high-throughput reconfigurable hardware emulation architecture with decoder and peripheral co-design. The architecture enables a library and script-based framework that automates the construction of FPGA emulations. Code and decoder design parameters are programmed either during run time or by script in design time. We demonstrate the capability of the framework in evaluating practical code and decoder design by experimenting with two popular nonbinary LDPC codes, regular (2, dc) codes and quasi-cyclic codes: each emulation model can be auto-constructed within hours and the decoder delivers excellent error-correcting performance on a Xilinx Virtex-5 FPGA with throughput of up to hundreds of Mbps.

preprint2022arXiv

Efficient Post-Processors for Improving Error-Correcting Performance of LDPC Codes

The error floor phenomenon, associated with iterative decoders, is one of the most significant limitations to the applications of low-density parity-check (LDPC) codes. A variety of techniques from code design to decoder implementation have been proposed to address the error floor problem, among which post-processors have shown to be both effective and implementation-friendly. In this work, we take the inspiration from simulated annealing to generalize the post-processor design using three methods: quenching, extended heating, and focused heating, each of which targets a different error structure. The resulting post-processor is demonstrated to lower the error floors by two orders of magnitude for two structured code examples, a (2209, 1978) array LDPC code, and a (1944, 1620) LDPC code used by the IEEE 802.11n standard. The post-processor can be integrated to a belief-propagation decoder with minimal overhead. The post-processor design is equally applicable to other structured LDPC codes.

preprint2022arXiv

High-Throughput Split-Tree Architecture for Nonbinary SCL Polar Decoder

Nonbinary polar codes defined over Galois field GF(q) have shown improved error-correction performance than binary polar codes using successive-cancellation list (SCL) decoding. However, nonbinary operations are complex and a direct-mapped decoder results in a low throughput, representing difficulties for practical adoptions. In this work, we develop, to the best of our knowledge, the first hardware implementation for nonbinary SCL polar decoding. We present a high-throughput decoder architecture using a split-tree algorithm. The sub-trees are decoded in parallel by smaller sub-decoders with a reconciliation stage to maintain constraints between sub-trees. A skimming algorithm is proposed to reduce the reconciliation complexity for further improved throughput. The split-tree nonbinary SCL (S-NBSCL) polar decoder is prototyped using a 28nm CMOS technology for a (128,64) polar code over GF(256). The decoder delivers 26.1 Mb/s throughput, 11.65 Mb/s/mm$^2$ area efficiency and 28.8 nJ/b energy efficiency, outperforming the direct-mapped decoder by 10.3x, 4.4x and 2.7x, respectively, while achieving excellent error-correction performance.

preprint2022arXiv

HiMA: A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer

Memory-augmented neural networks (MANNs) provide better inference performance in many tasks with the help of an external memory. The recently developed differentiable neural computer (DNC) is a MANN that has been shown to outperform in representing complicated data structures and learning long-term dependencies. DNC's higher performance is derived from new history-based attention mechanisms in addition to the previously used content-based attention mechanisms. History-based mechanisms require a variety of new compute primitives and state memories, which are not supported by existing neural network (NN) or MANN accelerators. We present HiMA, a tiled, history-based memory access engine with distributed memories in tiles. HiMA incorporates a multi-mode network-on-chip (NoC) to reduce the communication latency and improve scalability. An optimal submatrix-wise memory partition strategy is applied to reduce the amount of NoC traffic; and a two-stage usage sort method leverages distributed tiles to improve computation speed. To make HiMA fundamentally scalable, we create a distributed version of DNC called DNC-D to allow almost all memory operations to be applied to local memories with trainable weighted summation to produce the global memory output. Two approximation techniques, usage skimming and softmax approximation, are proposed to further enhance hardware efficiency. HiMA prototypes are created in RTL and synthesized in a 40nm technology. By simulations, HiMA running DNC and DNC-D demonstrates 6.47x and 39.1x higher speed, 22.8x and 164.3x better area efficiency, and 6.1x and 61.2x better energy efficiency over the state-of-the-art MANN accelerator. Compared to an Nvidia 3080Ti GPU, HiMA demonstrates speedup by up to 437x and 2,646x when running DNC and DNC-D, respectively.

Yaoyu Tao

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

An Automated FPGA-based Framework for Rapid Prototyping of Nonbinary LDPC Codes

Efficient Post-Processors for Improving Error-Correcting Performance of LDPC Codes

High-Throughput Split-Tree Architecture for Nonbinary SCL Polar Decoder

HiMA: A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer