Researcher profile

Warren J. Gross

Warren J. Gross contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2022arXiv

Efficient Fine-Tuning of BERT Models on the Edge

Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased resource requirements, namely memory, computation, energy, and time. Furthermore, training is far more resource intensive than inference. Resource-constrained on-device learning is thus doubly difficult, especially with large BERT-like models. By reducing the memory usage of fine-tuning, pre-trained BERT models can become efficient enough to fine-tune on resource-constrained devices. We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models that reduces the memory usage of activation maps during fine-tuning by avoiding unnecessary parameter updates. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.

preprint2022arXiv

Efficient Fine-Tuning of Compressed Language Models with Learners

Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower resource utilization.

preprint2022arXiv

Fast Successive-Cancellation List Flip Decoding of Polar Codes

This work presents a fast successive-cancellation list flip (Fast-SCLF) decoding algorithm for polar codes that addresses the high latency issue associated with the successive-cancellation list flip (SCLF) decoding algorithm. We first propose a bit-flipping strategy tailored to the state-of-the-art fast successive-cancellation list (FSCL) decoding that avoids tree-traversal in the binary tree representation of SCLF, thus reducing the latency of the decoding process. We then derive a parameterized path selection error model to accurately estimate the bit index at which the correct decoding path is eliminated from the initial FSCL decoding. The trainable parameter is optimized online based on an efficient supervised learning framework. Simulation results show that for a polar code of length 512 with 256 information bits, with similar error-correction performance and memory consumption, the proposed Fast-SCLF decoder reduces up to $73.4\%$ of the average decoding latency of the SCLF decoder with the same list size at the frame error rate of $10^{-4}$, while incurring a maximum computational complexity overhead of $27.6\%$. For the same polar code of length 512 with 256 information bits and at practical signal-to-noise ratios, the proposed decoder with list size 4 reduces $89.3\%$ and $43.7\%$ of the average complexity and decoding latency of the FSCL decoder with list size 32 (FSCL-32), respectively, while also reducing $83.2\%$ of the memory consumption of FSCL-32. The significant improvements of the proposed decoder come at the cost of $0.07$ dB error-correction performance degradation compared with FSCL-32.

preprint2022arXiv

High-Throughput and Energy-Efficient VLSI Architecture for Ordered Reliability Bits GRAND

Ultra-reliable low-latency communication (URLLC), a major 5G New-Radio use case, is the key enabler for applications with strict reliability and latency requirements. These applications necessitate the use of short-length and high-rate codes. Guessing Random Additive Noise Decoding (GRAND) is a recently proposed Maximum Likelihood (ML) decoding technique for these short-length and high-rate codes. Rather than decoding the received vector, GRAND tries to infer the noise that corrupted the transmitted codeword during transmission through the communication channel. As a result, GRAND can decode any code, structured or unstructured. GRAND has hard-input as well as soft-input variants. Among these variants, Ordered Reliability Bits GRAND (ORBGRAND) is a soft-input variant that outperforms hard-input GRAND and is suitable for parallel hardware implementation. This work reports the first hardware architecture for ORBGRAND, which achieves an average throughput of up to $42.5$ Gbps for a code length of $128$ at a target FER of $10^{-7}$. Furthermore, the proposed hardware can be used to decode any code as long as the length and rate constraints are met. In comparison to the GRANDAB, a hard-input variant of GRAND, the proposed architecture enhances decoding performance by at least $2$ dB. When compared to the state-of-the-art fast dynamic successive cancellation flip decoder (Fast-DSCF) using a 5G polar $(128,105)$ code, the proposed ORBGRAND VLSI implementation has $49\times$ higher average throughput, $32\times$ times more energy efficiency, and $5\times$ more area efficiency while maintaining similar decoding performance.

preprint2022arXiv

Standard Deviation-Based Quantization for Deep Neural Networks

Quantization of deep neural networks is a promising approach that reduces the inference cost, making it feasible to run deep networks on resource-restricted devices. Inspired by existing methods, we propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions, i.e., standard deviation. Furthermore, we propose a novel base-2 logarithmic quantization scheme to quantize weights to power-of-two discrete values. Our proposed scheme allows us to replace resource-hungry high-precision multipliers with simple shift-add operations. According to our evaluations, our method outperforms existing work on CIFAR10 and ImageNet datasets and even achieves better accuracy performance with 3-bit weights and activations when compared to the full-precision models. Moreover, our scheme simultaneously prunes the network's parameters and allows us to flexibly adjust the pruning ratio during the quantization process.

preprint2022arXiv

Successive-Cancellation Decoding of Reed-Muller Codes with Fast Hadamard Transform

A novel permuted fast successive-cancellation list decoding algorithm with fast Hadamard transform (FHT-FSCL) is presented. The proposed decoder initializes $L$ $(L\ge1)$ active decoding paths with $L$ random codeword permutations sampled from the full symmetry group of the codes. The path extension in the permutation domain is carried out until the first constituent RM code of order $1$ is visited. Conventional path extension of the successive-cancellation list decoder is then utilized in the information bit domain. The simulation results show that for a RM code of length $512$ with $46$ information bits, by running $20$ parallel permuted FHT-FSCL decoders with $L=4$, we reduce $72\%$ of the computational complexity, $22\%$ of the decoding latency, and $84\%$ of the memory consumption of the state-of-the-art simplified successive-cancellation decoder that uses $512$ permutations sampled from the full symmetry group of the code, with similar error-correction performance at the target frame error rate of $10^{-4}$.

preprint2020arXiv

Fast Thresholded SC-Flip Decoding of Polar Codes

SC-Flip (SCF) decoding algorithm shares the attention with the common polar code decoding approaches due to its low-complexity and improved error-correction performance. However, the inefficient criterion for locating the correct bit-flipping position in SCF decoding limits its improvements. Due to its improved bit-flipping criterion, Thresholded SCF (TSCF) decoding algorithm exhibits a superior error-correction performance and lower computational complexity than SCF decoding. However, the parameters of TSCF decoding depend on multiple channel and code parameters, and are obtained via Monte-Carlo simulations. Our main goal is to realize TSCF decoding as a practical polar decoder implementation. To this end, we first realize an approximated threshold value that is independent of the code parameters and precomputations. The proposed approximation has negligible error-correction performance degradation on the TSCF decoding. Then, we validate an alternative approach for forming a critical set that does not require precomputations, which also paves the way to the implementation of the Fast-TSCF decoder. Compared to the existing fast SCF implementations, the proposed Fast-TSCF decoder has $0.24$ to $0.41$ dB performance gain at frame error rate of $10^{-3}$, without any extra cost. Compared to the TSCF decoding, Fast-TSCF does not depend on precomputations and requires $87\%$ fewer decoding steps. Finally, implementation results in TSMC 65nm CMOS technology show that the Fast-TSCF decoder is $20\%$ and $82\%$ more area-efficient than the state-of-the-art fast SCF and fast SC-List decoder architectures, respectively.

preprint2020arXiv

High-Throughput VLSI Architecture for GRAND

Guessing Random Additive Noise Decoding (GRAND) is a recently proposed universal decoding algorithm for linear error correcting codes. Since GRAND does not depend on the structure of the code, it can be used for any code encountered in contemporary communication standards or may even be used for random linear network coding. This property makes this new algorithm particularly appealing. Instead of trying to decode the received vector, GRAND attempts to identify the noise that corrupted the codeword. To that end, GRAND relies on the generation of test error patterns that are successively applied to the received vector. In this paper, we propose the first hardware architecture for the GRAND algorithm. Considering GRAND with ABandonment (GRANDAB) that limits the number of test patterns, the proposed architecture only needs $2+\sum_{i=2}^{n} \left\lfloor\frac{i}{2}\right\rfloor$ time steps to perform the $\sum_{i=1}^3 \binom{n}{i}$ queries required when $\text{AB}=3$. For a code length of $128$, our proposed hardware architecture demonstrates only a fraction ($1.2\%$) of the total number of performed queries as time steps. Synthesis result using TSMC 65nm CMOS technology shows that average throughputs of $32$ Gbps to $64$ Gbps can be achieved at an SNR of $10$ dB for a code length of $128$ and code rates rate higher than $0.75$, transmitted over an AWGN channel. Comparisons with a decoder tailored for a $(79,64)$ BCH code show that the proposed architecture can achieve a slightly higher average throughput at high SNRs, while obtaining the same decoding performance.

preprint2020arXiv

Operation Merging for Hardware Implementations of Fast Polar Decoders

Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for $5^{\text{th}}$ generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous works targeting improved throughput for successive-cancellation (SC) decoding of polar codes are semi-parallel implementations that exploit special maximum-likelihood (ML) nodes. In this work, we present a new fast simplified SC (Fast-SSC) decoder architecture. Compared to a baseline Fast-SSC decoder, our solution is able to reduce the memory requirements. We achieve this through a more efficient memory utilization, which also enables to execute multiple operations in a single clock cycle. Finally, we propose new special node merging techniques that improve the throughput further, and detail a new Fast-SSC-based decoder architecture to support merged operations. The proposed decoder reduces the operation sequence requirement by up to $39\%$, which enables to reduce the number of time steps to decode a codeword by $35\%$. ASIC implementation results with 65 nm TSMC technology show that the proposed decoder has a throughput improvement of up to $31\%$ compared to previous Fast-SSC decoder architectures.

preprint2020arXiv

Practical Dynamic SC-Flip Polar Decoders: Algorithm and Implementation

SC-Flip (SCF) is a low-complexity polar code decoding algorithm with improved performance, and is an alternative to high-complexity (CRC)-aided SC-List (CA-SCL) decoding. However, the performance improvement of SCF is limited since it can correct up to only one channel error ($ω=1$). Dynamic SCF (DSCF) algorithm tackles this problem by tackling multiple errors ($ω\geq 1$), but it requires logarithmic and exponential computations, which make it infeasible for practical applications. In this work, we propose simplifications and approximations to make DSCF practically feasible. First, we reduce the transcendental computations of DSCF decoding to a constant approximation. Then, we show how to incorporate special node decoding techniques into DSCF algorithm, creating the Fast-DSCF decoding. Next, we reduce the search span within the special nodes to further reduce the computational complexity. Following, we describe a hardware architecture for the Fast-DSCF decoder, in which we introduce additional simplifications such as metric normalization and sorter length reduction. All the simplifications and approximations are shown to have minimal impact on the error-correction performance, and the reported Fast-DSCF decoder is the only SCF-based architecture that can correct multiple errors. The Fast-DSCF decoders synthesized using TSMC $65$nm CMOS technology can achieve a $1.25$, $1.06$ and $0.93$ Gbps throughput for $ω\in \{1,2,3\}$, respectively. Compared to the state-of-the-art fast CA-SCL decoders with equivalent FER performance, the proposed decoders are up to $5.8\times$ more area-efficient. Finally, observations at energy dissipation indicate that the Fast-DSCF is more energy-efficient than its CA-SCL-based counterparts.

preprint2019arXiv

Rate-Flexible Fast Polar Decoders

Polar codes have gained extensive attention during the past few years and recently they have been selected for the next generation of wireless communications standards (5G). Successive-cancellation-based (SC-based) decoders, such as SC list (SCL) and SC flip (SCF), provide a reasonable error performance for polar codes at the cost of low decoding speed. Fast SC-based decoders, such as Fast-SSC, Fast-SSCL, and Fast-SSCF, identify the special constituent codes in a polar code graph off-line, produce a list of operations, store the list in memory, and feed the list to the decoder to decode the constituent codes in order efficiently, thus increasing the decoding speed. However, the list of operations is dependent on the code rate and as the rate changes, a new list is produced, making fast SC-based decoders not rate-flexible. In this paper, we propose a completely rate-flexible fast SC-based decoder by creating the list of operations directly in hardware, with low implementation complexity. We further propose a hardware architecture implementing the proposed method and show that the area occupation of the rate-flexible fast SC-based decoder in this paper is only $38\%$ of the total area of the memory-based base-line decoder when 5G code rates are supported.