Source author record

Warren J. Gross

Warren J. Gross appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Hardware Architecture Machine Learning Computation and Language Neural and Evolutionary Computing Artificial Intelligence eess.SP math.OC math.PR Performance

Catalog footprint

What is connected

31works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Efficient Fine-Tuning of BERT Models on the Edge

Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased resource requirements, namely memory, computation, energy, and time. Furthermore, training is far more resource intensive than inference. Resource-constrained on-device learning is thus doubly difficult, especially with large BERT-like models. By reducing the memory usage of fine-tuning, pre-trained BERT models can become efficient enough to fine-tune on resource-constrained devices. We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models that reduces the memory usage of activation maps during fine-tuning by avoiding unnecessary parameter updates. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.

preprint2022arXiv

Efficient Fine-Tuning of Compressed Language Models with Learners

Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower resource utilization.

preprint2022arXiv

Fast Successive-Cancellation List Flip Decoding of Polar Codes

This work presents a fast successive-cancellation list flip (Fast-SCLF) decoding algorithm for polar codes that addresses the high latency issue associated with the successive-cancellation list flip (SCLF) decoding algorithm. We first propose a bit-flipping strategy tailored to the state-of-the-art fast successive-cancellation list (FSCL) decoding that avoids tree-traversal in the binary tree representation of SCLF, thus reducing the latency of the decoding process. We then derive a parameterized path selection error model to accurately estimate the bit index at which the correct decoding path is eliminated from the initial FSCL decoding. The trainable parameter is optimized online based on an efficient supervised learning framework. Simulation results show that for a polar code of length 512 with 256 information bits, with similar error-correction performance and memory consumption, the proposed Fast-SCLF decoder reduces up to $73.4\%$ of the average decoding latency of the SCLF decoder with the same list size at the frame error rate of $10^{-4}$, while incurring a maximum computational complexity overhead of $27.6\%$. For the same polar code of length 512 with 256 information bits and at practical signal-to-noise ratios, the proposed decoder with list size 4 reduces $89.3\%$ and $43.7\%$ of the average complexity and decoding latency of the FSCL decoder with list size 32 (FSCL-32), respectively, while also reducing $83.2\%$ of the memory consumption of FSCL-32. The significant improvements of the proposed decoder come at the cost of $0.07$ dB error-correction performance degradation compared with FSCL-32.

preprint2022arXiv

High-Throughput and Energy-Efficient VLSI Architecture for Ordered Reliability Bits GRAND

Ultra-reliable low-latency communication (URLLC), a major 5G New-Radio use case, is the key enabler for applications with strict reliability and latency requirements. These applications necessitate the use of short-length and high-rate codes. Guessing Random Additive Noise Decoding (GRAND) is a recently proposed Maximum Likelihood (ML) decoding technique for these short-length and high-rate codes. Rather than decoding the received vector, GRAND tries to infer the noise that corrupted the transmitted codeword during transmission through the communication channel. As a result, GRAND can decode any code, structured or unstructured. GRAND has hard-input as well as soft-input variants. Among these variants, Ordered Reliability Bits GRAND (ORBGRAND) is a soft-input variant that outperforms hard-input GRAND and is suitable for parallel hardware implementation. This work reports the first hardware architecture for ORBGRAND, which achieves an average throughput of up to $42.5$ Gbps for a code length of $128$ at a target FER of $10^{-7}$. Furthermore, the proposed hardware can be used to decode any code as long as the length and rate constraints are met. In comparison to the GRANDAB, a hard-input variant of GRAND, the proposed architecture enhances decoding performance by at least $2$ dB. When compared to the state-of-the-art fast dynamic successive cancellation flip decoder (Fast-DSCF) using a 5G polar $(128,105)$ code, the proposed ORBGRAND VLSI implementation has $49\times$ higher average throughput, $32\times$ times more energy efficiency, and $5\times$ more area efficiency while maintaining similar decoding performance.

preprint2022arXiv

Standard Deviation-Based Quantization for Deep Neural Networks

Quantization of deep neural networks is a promising approach that reduces the inference cost, making it feasible to run deep networks on resource-restricted devices. Inspired by existing methods, we propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions, i.e., standard deviation. Furthermore, we propose a novel base-2 logarithmic quantization scheme to quantize weights to power-of-two discrete values. Our proposed scheme allows us to replace resource-hungry high-precision multipliers with simple shift-add operations. According to our evaluations, our method outperforms existing work on CIFAR10 and ImageNet datasets and even achieves better accuracy performance with 3-bit weights and activations when compared to the full-precision models. Moreover, our scheme simultaneously prunes the network's parameters and allows us to flexibly adjust the pruning ratio during the quantization process.

preprint2022arXiv

Successive-Cancellation Decoding of Reed-Muller Codes with Fast Hadamard Transform

A novel permuted fast successive-cancellation list decoding algorithm with fast Hadamard transform (FHT-FSCL) is presented. The proposed decoder initializes $L$ $(L\ge1)$ active decoding paths with $L$ random codeword permutations sampled from the full symmetry group of the codes. The path extension in the permutation domain is carried out until the first constituent RM code of order $1$ is visited. Conventional path extension of the successive-cancellation list decoder is then utilized in the information bit domain. The simulation results show that for a RM code of length $512$ with $46$ information bits, by running $20$ parallel permuted FHT-FSCL decoders with $L=4$, we reduce $72\%$ of the computational complexity, $22\%$ of the decoding latency, and $84\%$ of the memory consumption of the state-of-the-art simplified successive-cancellation decoder that uses $512$ permutations sampled from the full symmetry group of the code, with similar error-correction performance at the target frame error rate of $10^{-4}$.

preprint2020arXiv

Fast Thresholded SC-Flip Decoding of Polar Codes

SC-Flip (SCF) decoding algorithm shares the attention with the common polar code decoding approaches due to its low-complexity and improved error-correction performance. However, the inefficient criterion for locating the correct bit-flipping position in SCF decoding limits its improvements. Due to its improved bit-flipping criterion, Thresholded SCF (TSCF) decoding algorithm exhibits a superior error-correction performance and lower computational complexity than SCF decoding. However, the parameters of TSCF decoding depend on multiple channel and code parameters, and are obtained via Monte-Carlo simulations. Our main goal is to realize TSCF decoding as a practical polar decoder implementation. To this end, we first realize an approximated threshold value that is independent of the code parameters and precomputations. The proposed approximation has negligible error-correction performance degradation on the TSCF decoding. Then, we validate an alternative approach for forming a critical set that does not require precomputations, which also paves the way to the implementation of the Fast-TSCF decoder. Compared to the existing fast SCF implementations, the proposed Fast-TSCF decoder has $0.24$ to $0.41$ dB performance gain at frame error rate of $10^{-3}$, without any extra cost. Compared to the TSCF decoding, Fast-TSCF does not depend on precomputations and requires $87\%$ fewer decoding steps. Finally, implementation results in TSMC 65nm CMOS technology show that the Fast-TSCF decoder is $20\%$ and $82\%$ more area-efficient than the state-of-the-art fast SCF and fast SC-List decoder architectures, respectively.

preprint2020arXiv

High-Throughput VLSI Architecture for GRAND

Guessing Random Additive Noise Decoding (GRAND) is a recently proposed universal decoding algorithm for linear error correcting codes. Since GRAND does not depend on the structure of the code, it can be used for any code encountered in contemporary communication standards or may even be used for random linear network coding. This property makes this new algorithm particularly appealing. Instead of trying to decode the received vector, GRAND attempts to identify the noise that corrupted the codeword. To that end, GRAND relies on the generation of test error patterns that are successively applied to the received vector. In this paper, we propose the first hardware architecture for the GRAND algorithm. Considering GRAND with ABandonment (GRANDAB) that limits the number of test patterns, the proposed architecture only needs $2+\sum_{i=2}^{n} \left\lfloor\frac{i}{2}\right\rfloor$ time steps to perform the $\sum_{i=1}^3 \binom{n}{i}$ queries required when $\text{AB}=3$. For a code length of $128$, our proposed hardware architecture demonstrates only a fraction ($1.2\%$) of the total number of performed queries as time steps. Synthesis result using TSMC 65nm CMOS technology shows that average throughputs of $32$ Gbps to $64$ Gbps can be achieved at an SNR of $10$ dB for a code length of $128$ and code rates rate higher than $0.75$, transmitted over an AWGN channel. Comparisons with a decoder tailored for a $(79,64)$ BCH code show that the proposed architecture can achieve a slightly higher average throughput at high SNRs, while obtaining the same decoding performance.

preprint2020arXiv

Operation Merging for Hardware Implementations of Fast Polar Decoders

Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for $5^{\text{th}}$ generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous works targeting improved throughput for successive-cancellation (SC) decoding of polar codes are semi-parallel implementations that exploit special maximum-likelihood (ML) nodes. In this work, we present a new fast simplified SC (Fast-SSC) decoder architecture. Compared to a baseline Fast-SSC decoder, our solution is able to reduce the memory requirements. We achieve this through a more efficient memory utilization, which also enables to execute multiple operations in a single clock cycle. Finally, we propose new special node merging techniques that improve the throughput further, and detail a new Fast-SSC-based decoder architecture to support merged operations. The proposed decoder reduces the operation sequence requirement by up to $39\%$, which enables to reduce the number of time steps to decode a codeword by $35\%$. ASIC implementation results with 65 nm TSMC technology show that the proposed decoder has a throughput improvement of up to $31\%$ compared to previous Fast-SSC decoder architectures.

preprint2020arXiv

Practical Dynamic SC-Flip Polar Decoders: Algorithm and Implementation

SC-Flip (SCF) is a low-complexity polar code decoding algorithm with improved performance, and is an alternative to high-complexity (CRC)-aided SC-List (CA-SCL) decoding. However, the performance improvement of SCF is limited since it can correct up to only one channel error ($ω=1$). Dynamic SCF (DSCF) algorithm tackles this problem by tackling multiple errors ($ω\geq 1$), but it requires logarithmic and exponential computations, which make it infeasible for practical applications. In this work, we propose simplifications and approximations to make DSCF practically feasible. First, we reduce the transcendental computations of DSCF decoding to a constant approximation. Then, we show how to incorporate special node decoding techniques into DSCF algorithm, creating the Fast-DSCF decoding. Next, we reduce the search span within the special nodes to further reduce the computational complexity. Following, we describe a hardware architecture for the Fast-DSCF decoder, in which we introduce additional simplifications such as metric normalization and sorter length reduction. All the simplifications and approximations are shown to have minimal impact on the error-correction performance, and the reported Fast-DSCF decoder is the only SCF-based architecture that can correct multiple errors. The Fast-DSCF decoders synthesized using TSMC $65$nm CMOS technology can achieve a $1.25$, $1.06$ and $0.93$ Gbps throughput for $ω\in \{1,2,3\}$, respectively. Compared to the state-of-the-art fast CA-SCL decoders with equivalent FER performance, the proposed decoders are up to $5.8\times$ more area-efficient. Finally, observations at energy dissipation indicate that the Fast-DSCF is more energy-efficient than its CA-SCL-based counterparts.

preprint2019arXiv

Rate-Flexible Fast Polar Decoders

Polar codes have gained extensive attention during the past few years and recently they have been selected for the next generation of wireless communications standards (5G). Successive-cancellation-based (SC-based) decoders, such as SC list (SCL) and SC flip (SCF), provide a reasonable error performance for polar codes at the cost of low decoding speed. Fast SC-based decoders, such as Fast-SSC, Fast-SSCL, and Fast-SSCF, identify the special constituent codes in a polar code graph off-line, produce a list of operations, store the list in memory, and feed the list to the decoder to decode the constituent codes in order efficiently, thus increasing the decoding speed. However, the list of operations is dependent on the code rate and as the rate changes, a new list is produced, making fast SC-based decoders not rate-flexible. In this paper, we propose a completely rate-flexible fast SC-based decoder by creating the list of operations directly in hardware, with low implementation complexity. We further propose a hardware architecture implementing the proposed method and show that the area occupation of the rate-flexible fast SC-based decoder in this paper is only $38\%$ of the total area of the memory-based base-line decoder when 5G code rates are supported.

preprint2016arXiv

Fast Low-Complexity Decoders for Low-Rate Polar Codes

Polar codes are capacity-achieving error-correcting codes with an explicit construction that can be decoded with low-complexity algorithms. In this work, we show how the state-of-the-art low-complexity decoding algorithm can be improved to better accommodate low-rate codes. More constituent codes are recognized in the updated algorithm and dedicated hardware is added to efficiently decode these new constituent codes. We also alter the polar code construction to further decrease the latency and increase the throughput with little to no noticeable effect on error-correction performance. Rate-flexible decoders for polar codes of length 1024 and 2048 are implemented on FPGA. Over the previous work, they are shown to have from 22% to 28% lower latency and 26% to 34% greater throughput when decoding low-rate codes. On 65 nm ASIC CMOS technology, the proposed decoder for a (1024, 512) polar code is shown to compare favorably against the state-of-the-art ASIC decoders. With a clock frequency of 400 MHz and a supply voltage of 0.8 V, it has a latency of 0.41 $μ$s and an area efficiency of 1.8 Gbps/mm$^2$ for an energy efficiency of 77 pJ/info. bit. At 600 MHz with a supply of 1 V, the latency is reduced to 0.27 $μ$s and the area efficiency increased to 2.7 Gbps/mm$^2$ at 115 pJ/info. bit.

preprint2016arXiv

Flexible and Low-Complexity Encoding and Decoding of Systematic Polar Codes

In this work, we present hardware and software implementations of flexible polar systematic encoders and decoders. The proposed implementations operate on polar codes of any length less than a maximum and of any rate. We describe the low-complexity, highly parallel, and flexible systematic-encoding algorithm that we use and prove its correctness. Our hardware implementation results show that the overhead of adding code rate and length flexibility is little, and the impact on operation latency minor compared to code-specific versions. Finally, the flexible software encoder and decoder implementations are also shown to be able to maintain high throughput and low latency.

preprint2016arXiv

Hardware Decoders for Polar Codes: An Overview

Polar codes are an exciting new class of error correcting codes that achieve the symmetric capacity of memoryless channels. Many decoding algorithms were developed and implemented, addressing various application requirements: from error-correction performance rivaling that of LDPC codes to very high throughput or low-complexity decoders. In this work, we review the state of the art in polar decoders implementing the successive-cancellation, belief propagation, and list decoding algorithms, illustrating their advantages.

preprint2016arXiv

Low-Latency Software Polar Decoders

Polar codes are a new class of capacity-achieving error-correcting codes with low encoding and decoding complexity. Their low-complexity decoding algorithms rendering them attractive for use in software-defined radio applications where computational resources are limited. In this work, we present low-latency software polar decoders that exploit modern processor capabilities. We show how adapting the algorithm at various levels can lead to significant improvements in latency and throughput, yielding polar decoders that are suitable for high-performance software-defined radio applications on modern desktop processors and embedded-platform processors. These proposed decoders have an order of magnitude lower latency and memory footprint compared to state-of-the-art decoders, while maintaining comparable throughput. In addition, we present strategies and results for implementing polar decoders on graphical processing units. Finally, we show that the energy efficiency of the proposed decoders is comparable to state-of-the-art software polar decoders.

preprint2016arXiv

Multi-mode Unrolled Architectures for Polar Decoders

In this work, we present a family of architectures for polar decoders using a reduced-complexity successive-cancellation decoding algorithm that employs unrolling to achieve extremely high throughput values while retaining moderate implementation complexity. The resulting fully-unrolled, deeply-pipelined architecture is capable of achieving a coded throughput in excess of 1 Tbps on a 65 nm ASIC at 500 MHz---three orders of magnitude greater than current state-of-the-art polar decoders. However, unrolled decoders are built for a specific, fixed code. Therefore we also present a new method to enable the use of multiple code lengths and rates in a fully-unrolled polar decoder architecture. This method leads to a length- and rate-flexible decoder while retaining the very high speed typical to unrolled decoders. The resulting decoders can decode a master polar code of a given rate and length, and several shorter codes of different rates and lengths. We present results for two versions of a multi-mode decoder supporting eight and ten different polar codes, respectively. Both are capable of a peak throughput of 25.6 Gbps. For each decoder, the energy efficiency for the longest supported polar code is shown to be of 14.8 pJ/bit at 250 MHz and of 8.8 pJ/bit at 500 MHz.

preprint2016arXiv

Neural Networks Designing Neural Networks: Multi-Objective Hyper-Parameter Optimization

Artificial neural networks have gone through a recent rise in popularity, achieving state-of-the-art results in various fields, including image classification, speech recognition, and automated control. Both the performance and computational complexity of such models are heavily dependant on the design of characteristic hyper-parameters (e.g., number of hidden layers, nodes per layer, or choice of activation functions), which have traditionally been optimized manually. With machine learning penetrating low-power mobile and embedded areas, the need to optimize not only for performance (accuracy), but also for implementation complexity, becomes paramount. In this work, we present a multi-objective design space exploration method that reduces the number of solution networks trained and evaluated through response surface modelling. Given spaces which can easily exceed 1020 solutions, manually designing a near-optimal architecture is unlikely as opportunities to reduce network complexity, while maintaining performance, may be overlooked. This problem is exacerbated by the fact that hyper-parameters which perform well on specific datasets may yield sub-par results on others, and must therefore be designed on a per-application basis. In our work, machine learning is leveraged by training an artificial neural network to predict the performance of future candidate networks. The method is evaluated on the MNIST and CIFAR-10 image datasets, optimizing for both recognition accuracy and computational complexity. Experimental results demonstrate that the proposed method can closely approximate the Pareto-optimal front, while only exploring a small fraction of the design space.

preprint2015arXiv

Fast List Decoders for Polar Codes

Polar codes asymptotically achieve the symmetric capacity of memoryless channels, yet their error-correcting performance under successive-cancellation (SC) decoding for short and moderate length codes is worse than that of other modern codes such as low-density parity-check (LDPC) codes. Of the many methods to improve the error-correction performance of polar codes, list decoding yields the best results, especially when the polar code is concatenated with a cyclic redundancy check (CRC). List decoding involves exploring several decoding paths with SC decoding, and therefore tends to be slower than SC decoding itself, by an order of magnitude in practical implementations. In this paper, we present a new algorithm based on unrolling the decoding tree of the code that improves the speed of list decoding by an order of magnitude when implemented in software. Furthermore, we show that for software-defined radio applications, our proposed algorithm is faster than the fastest software implementations of LDPC decoders in the literature while offering comparable error-correction performance at similar or shorter code lengths.

preprint2014arXiv

A 237 Gbps Unrolled Hardware Polar Decoder

In this letter we present a new architecture for a polar decoder using a reduced complexity successive cancellation decoding algorithm. This novel fully-unrolled, deeply-pipelined architecture is capable of achieving a coded throughput of over 237 Gbps for a (1024,512) polar code implemented using an FPGA. This decoder is two orders of magnitude faster than state-of-the-art polar decoders.

preprint2014arXiv

Associative Memories Based on Multiple-Valued Sparse Clustered Networks

Associative memories are structures that store data patterns and retrieve them given partial inputs. Sparse Clustered Networks (SCNs) are recently-introduced binary-weighted associative memories that significantly improve the storage and retrieval capabilities over the prior state-of-the art. However, deleting or updating the data patterns result in a significant increase in the data retrieval error probability. In this paper, we propose an algorithm to address this problem by incorporating multiple-valued weights for the interconnections used in the network. The proposed algorithm lowers the error rate by an order of magnitude for our sample network with 60% deleted contents. We then investigate the advantages of the proposed algorithm for hardware implementations.

preprint2014arXiv

Fast Software Polar Decoders

Among error-correcting codes, polar codes are the first to provably achieve channel capacity with an explicit construction. In this work, we present software implementations of a polar decoder that leverage the capabilities of modern general-purpose processors to achieve an information throughput in excess of 200 Mbps, a throughput well suited for software-defined-radio applications. We also show that, for a similar error-correction performance, the throughput of polar decoders both surpasses that of LDPC decoders targeting general-purpose processors and is competitive with that of state-of-the-art software LDPC decoders running on graphic processing units.

preprint2014arXiv

In-Network Linear Regression with Arbitrarily Split Data Matrices

In this paper, we address the problem of how a network of agents can collaboratively fit a linear model when each agent only ever has an arbitrary summand of the regression data. This problem generalizes previously studied data-matrix-splitting scenarios, allowing for some agents to have more measurements of some features than of others and even have measurements that other agents have. We present a variable-centric framework for distributed optimization in a network, and use this framework to develop a proximal algorithm, based on the Douglas-Rachford method, that solves the problem.

preprint2014arXiv

Increasing the Speed of Polar List Decoders

In this work, we present a simplified successive cancellation list decoder that uses a Chase-like decoding process to achieve a six time improvement in speed compared to successive cancellation list decoding while maintaining the same error-correction performance advantage over standard successive-cancellation polar decoders. We discuss the algorithm and detail the data structures and methods used to obtain this speed-up. We also propose an adaptive decoding algorithm that significantly improves the throughput while retaining the error-correction performance. Simulation results over the additive white Gaussian noise channel are provided and show that the proposed system is up to 16 times faster than an LDPC decoder of the same frame size, code rate, and similar error-correction performance, making it more suitable for use as a software decoding solution.

preprint2013arXiv

A Low-Power Content-Addressable-Memory Based on Clustered-Sparse-Networks

A low-power Content-Addressable-Memory (CAM) is introduced employing a new mechanism for associativity between the input tags and the corresponding address of the output data. The proposed architecture is based on a recently developed clustered-sparse-network using binary-weighted connections that on-average will eliminate most of the parallel comparisons performed during a search. Therefore, the dynamic energy consumption of the proposed design is significantly lower compared to that of a conventional low-power CAM design. Given an input tag, the proposed architecture computes a few possibilities for the location of the matched tag and performs the comparisons on them to locate a single valid match. A 0.13 um CMOS technology was used for simulation purposes. The energy consumption and the search delay of the proposed design are 9.5%, and 30.4% of that of the conventional NAND architecture respectively with a 3.4% higher number of transistors.

preprint2013arXiv

Fast Polar Decoders: Algorithm and Implementation

Polar codes provably achieve the symmetric capacity of a memoryless channel while having an explicit construction. This work aims to increase the throughput of polar decoder hardware by an order of magnitude relative to the state of the art successive-cancellation decoder. We present an algorithm, architecture, and FPGA implementation of a gigabit-per-second polar decoder.

preprint2013arXiv

Scalable Successive-Cancellation Hardware Decoder for Polar Codes

Polar codes, discovered by Arıkan, are the first error-correcting codes with an explicit construction to provably achieve channel capacity, asymptotically. However, their error-correction performance at finite lengths tends to be lower than existing capacity-approaching schemes. Using the successive-cancellation algorithm, polar decoders can be designed for very long codes, with low hardware complexity, leveraging the regular structure of such codes. We present an architecture and an implementation of a scalable hardware decoder based on this algorithm. This design is shown to scale to code lengths of up to N = 2^20 on an Altera Stratix IV FPGA, limited almost exclusively by the amount of available SRAM.

preprint2013arXiv

Selective Decoding in Associative Memories Based on Sparse-Clustered Networks

Associative memories are structures that can retrieve previously stored information given a partial input pattern instead of an explicit address as in indexed memories. A few hardware approaches have recently been introduced for a new family of associative memories based on Sparse-Clustered Networks (SCN) that show attractive features. These architectures are suitable for implementations with low retrieval latency, but are limited to small networks that store a few hundred data entries. In this paper, a new hardware architecture of SCNs is proposed that features a new data-storage technique as well as a method we refer to as Selective Decoding (SD-SCN). The SD-SCN has been implemented using a similar FPGA used in the previous efforts and achieves two orders of magnitude higher capacity, with no error-performance penalty but with the cost of few extra clock cycles per data access.

preprint2012arXiv

A Chernoff-type Lower Bound for the Gaussian Q-function

A lower bound for the Gaussian Q-function is presented in the form of a single exponential function with parametric order and weight. We prove the lower bound by introducing two functions, one related to the Q-function and the other similarly related to the exponential function, and by obtaining inequalities that indicate the sign of the difference of the two functions.

preprint2012arXiv

Relaxed Half-Stochastic Belief Propagation

Low-density parity-check codes are attractive for high throughput applications because of their low decoding complexity per bit, but also because all the codeword bits can be decoded in parallel. However, achieving this in a circuit implementation is complicated by the number of wires required to exchange messages between processing nodes. Decoding algorithms that exchange binary messages are interesting for fully-parallel implementations because they can reduce the number and the length of the wires, and increase logic density. This paper introduces the Relaxed Half-Stochastic (RHS) decoding algorithm, a binary message belief propagation (BP) algorithm that achieves a coding gain comparable to the best known BP algorithms that use real-valued messages. We derive the RHS algorithm by starting from the well-known Sum-Product algorithm, and then derive a low-complexity version suitable for circuit implementation. We present extensive simulation results on two standardized codes having different rates and constructions, including low bit error rate results. These simulations show that RHS can be an advantageous replacement for the existing state-of-the-art decoding algorithms when targeting fully-parallel implementations.

preprint2011arXiv

Hardware Implementation of Successive Cancellation Decoders for Polar Codes

The recently-discovered polar codes are seen as a major breakthrough in coding theory; they provably achieve the theoretical capacity of discrete memoryless channels using the low complexity successive cancellation (SC) decoding algorithm. Motivated by recent developments in polar coding theory, we propose a family of efficient hardware implementations for SC polar decoders. We show that such decoders can be implemented with O(n) processing elements, O(n) memory elements, and can provide a constant throughput for a given target clock frequency. Furthermore, we show that SC decoding can be implemented in the logarithm domain, thereby eliminating costly multiplication and division operations and reducing the complexity of each processing element greatly. We also present a detailed architecture for an SC decoder and provide logic synthesis results confirming the linear growth in complexity of the decoder as the code length increases.

preprint2010arXiv

Hardware architectures for Successive Cancellation Decoding of Polar Codes

The recently-discovered polar codes are widely seen as a major breakthrough in coding theory. These codes achieve the capacity of many important channels under successive cancellation decoding. Motivated by the rapid progress in the theory of polar codes, we propose a family of architectures for efficient hardware implementation of successive cancellation decoders. We show that such decoders can be implemented with O(n) processing elements and O(n) memory elements, while providing constant throughput. We also propose a technique for overlapping the decoding of several consecutive codewords, thereby achieving a significant speed-up factor. We furthermore show that successive cancellation decoding can be implemented in the logarithmic domain, thereby eliminating the multiplication and division operations and greatly reducing the complexity of each processing element.

Warren J. Gross

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

Efficient Fine-Tuning of BERT Models on the Edge

Efficient Fine-Tuning of Compressed Language Models with Learners

Fast Successive-Cancellation List Flip Decoding of Polar Codes

High-Throughput and Energy-Efficient VLSI Architecture for Ordered Reliability Bits GRAND

Standard Deviation-Based Quantization for Deep Neural Networks

Successive-Cancellation Decoding of Reed-Muller Codes with Fast Hadamard Transform

Fast Thresholded SC-Flip Decoding of Polar Codes

High-Throughput VLSI Architecture for GRAND

Operation Merging for Hardware Implementations of Fast Polar Decoders

Practical Dynamic SC-Flip Polar Decoders: Algorithm and Implementation

Rate-Flexible Fast Polar Decoders

Fast Low-Complexity Decoders for Low-Rate Polar Codes

Flexible and Low-Complexity Encoding and Decoding of Systematic Polar Codes

Hardware Decoders for Polar Codes: An Overview

Low-Latency Software Polar Decoders

Multi-mode Unrolled Architectures for Polar Decoders

Neural Networks Designing Neural Networks: Multi-Objective Hyper-Parameter Optimization

Fast List Decoders for Polar Codes

A 237 Gbps Unrolled Hardware Polar Decoder

Associative Memories Based on Multiple-Valued Sparse Clustered Networks

Fast Software Polar Decoders

In-Network Linear Regression with Arbitrarily Split Data Matrices

Increasing the Speed of Polar List Decoders

A Low-Power Content-Addressable-Memory Based on Clustered-Sparse-Networks

Fast Polar Decoders: Algorithm and Implementation

Scalable Successive-Cancellation Hardware Decoder for Polar Codes

Selective Decoding in Associative Memories Based on Sparse-Clustered Networks

A Chernoff-type Lower Bound for the Gaussian Q-function

Relaxed Half-Stochastic Belief Propagation

Hardware Implementation of Successive Cancellation Decoders for Polar Codes

Hardware architectures for Successive Cancellation Decoding of Polar Codes