Topic overview

Hardware Architecture

791 works2988 researchers0 institutions

Topic snapshot

What this area looks like now

791works
2988authors
0experts visible
0communities

Next steps

Move from topic reading into action

The graph preview below keeps the nearby papers, people and communities visible in the same reading flow.

Topic graph

See the topic as a live network

Open full explorer

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Papers in this area

24 featured work(s)

preprint2018arXiv

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse

Deep Neural Networks (DNNs) have been widely deployed for many Machine Learning applications. Recently, CapsuleNets have overtaken traditional DNNs, because of their improved generalization ability due to the multi-dimensional capsules, in contrast to the single-dimensional neurons. Consequently, CapsuleNets also require extremely intense matrix computations, making it a gigantic challenge to achieve high performance. In this paper, we propose CapsAcc, the first specialized CMOS-based hardware architecture to perform CapsuleNets inference with high performance and energy efficiency. State-of-the-art convolutional DNN accelerators would not work efficiently for CapsuleNets, as their designs do not account for key operations involved in CapsuleNets, like squashing and dynamic routing, as well as multi-dimensional matrix processing. Our CapsAcc architecture targets this problem and achieves significant improvements, when compared to an optimized GPU implementation. Our architecture exploits the massive parallelism by flexibly feeding the data to a specialized systolic array according to the operations required in different layers. It also avoids extensive load and store operations on the on-chip memory, by reusing the data when possible. We further optimize the routing algorithm to reduce the computations needed at this stage. We synthesized the complete CapsAcc architecture in a 32nm CMOS technology using Synopsys design tools, and evaluated it for the MNIST benchmark (as also done by the original CapsuleNet paper) to ensure consistent and fair comparisons. This work enables highly-efficient CapsuleNets inference on embedded platforms.

preprint2019arXiv

Touché: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overheads

Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads. This paper proposes Touché, a framework that enables storing multiple arbitrary compressed cache blocks within a physical cacheline without any tag area overheads. The Touché framework consists of three components. The first component, called the ``Signature'' (SIGN) engine, creates shortened signatures from the tag addresses of compressed blocks. Due to this, the SIGN engine can store multiple signatures in each tag entry. On a cache access, the physical cacheline is accessed only if there is a signature match (which has a negligible probability of false positive). The second component, called the ``Tag Appended Data'' (TADA) mechanism, stores the full tag addresses with data. TADA enables Touché to detect false positive signature matches by ensuring that the actual tag address is available for comparison. The third component, called the ``Superblock Marker'' (SMARK) mechanism, uses a unique marker in the tag entry to indicate the occurrence of compressed cache blocks from neighboring physical addresses in the same cacheline. Touché is completely hardware-based and achieves an average speedup of 12\% (ideal 13\%) when compared to an uncompressed baseline.

preprint2020arXiv

A Novel Method for Scalable VLSI Implementation of Hyperbolic Tangent Function

Hyperbolic tangent and Sigmoid functions are used as non-linear activation units in the artificial and deep neural networks. Since, these networks are computationally expensive, customized accelerators are designed for achieving the required performance at lower cost and power. The activation function and MAC units are the key building blocks of these neural networks. A low complexity and accurate hardware implementation of the activation function is required to meet the performance and area targets of such neural network accelerators. Moreover, a scalable implementation is required as the recent studies show that the DNNs may use different precision in different layers. This paper presents a novel method based on trigonometric expansion properties of the hyperbolic function for hardware implementation which can be easily tuned for different accuracy and precision requirements.

preprint2020arXiv

A Systematic Study of Lattice-based NIST PQC Algorithms: from Reference Implementations to Hardware Accelerators

Security of currently deployed public key cryptography algorithms is foreseen to be vulnerable against quantum computer attacks. Hence, a community effort exists to develop post-quantum cryptography (PQC) algorithms, i.e., algorithms that are resistant to quantum attacks. In this work, we have investigated how lattice-based candidate algorithms from the NIST PQC standardization competition fare when conceived as hardware accelerators. To achieve this, we have assessed the reference implementations of selected algorithms with the goal of identifying what are their basic building blocks. We assume the hardware accelerators will be implemented in application specific integrated circuit (ASIC) and the targeted technology in our experiments is a commercial 65nm node. In order to estimate the characteristics of each algorithm, we have assessed their memory requirements, use of multipliers, and how each algorithm employs hashing functions. Furthermore, for these building blocks, we have collected area and power figures for 12 candidate algorithms. For memories, we make use of a commercial memory compiler. For logic, we make use of a standard cell library. In order to compare the candidate algorithms fairly, we select a reference frequency of operation of 500MHz. Our results reveal that our area and power numbers are comparable to the state of the art, despite targeting a higher frequency of operation and a higher security level in our experiments. The comprehensive investigation of lattice-based NIST PQC algorithms performed in this paper can be used for guiding ASIC designers when selecting an appropriate algorithm while respecting requirements and design constraints.

preprint2020arXiv

On the Impact of Partial Sums on Interconnect Bandwidth and Memory Accesses in a DNN Accelerator

Dedicated accelerators are being designed to address the huge resource requirement of the deep neural network (DNN) applications. The power, performance and area (PPA) constraints limit the number of MACs available in these accelerators. The convolution layers which require huge number of MACs are often partitioned into multiple iterative sub-tasks. This puts huge pressure on the available system resources such as interconnect and memory bandwidth. The optimal partitioning of the feature maps for these sub-tasks can reduce the bandwidth requirement substantially. Some accelerators avoid off-chip or interconnect transfers by implementing local memories; however, the memory accesses are still performed and a reduced bandwidth can help in saving power in such architectures. In this paper, we propose a first order analytical method to partition the feature maps for optimal bandwidth and evaluate the impact of such partitioning on the bandwidth. This bandwidth can be saved by designing an active memory controller which can perform basic arithmetic operations. It is shown that the optimal partitioning and active memory controller can achieve up to 40% bandwidth reduction.

preprint2020arXiv

Persistence and Synchronization: Friends or Foes?

Emerging non-volatile memory (NVM) technologies promise memory speed byte-addressable persistent storage with a load/store interface. However, programming applications to directly manipulate NVM data is complex and error-prone. Applications generally employ libraries that hide the low-level details of the hardware and provide a transactional programming model to achieve crash-consistency. Furthermore, applications continue to expect correctness during concurrent executions, achieved through the use of synchronization. To achieve this, applications seek well-known ACID guarantees. However, realizing this presents designers of transactional systems with a range of choices in how to combine several low-level techniques, given target hardware features and workload characteristics. In this paper, we provide a comprehensive evaluation of the impact of combining existing crash-consistency and synchronization methods for achieving performant and correct NVM transactional systems. We consider different hardware characteristics, in terms of support for hardware transactional memory (HTM) and the boundaries of the persistence domain (transient or persistent caches). By characterizing persistent transactional systems in terms of their properties, we make it possible to better understand the tradeoffs of different implementations and to arrive at better design choices for providing ACID guarantees. We use both real hardware with Intel Optane DC persistent memory and simulation to evaluate a persistent version of hardware transactional memory, a persistent version of software transactional memory, and undo/redo logging. Through our empirical study, we show two major factors that impact the cost of supporting persistence in transactional systems: the persistence domain (transient or persistent caches) and application characteristics, such as transaction size and parallelism.

preprint2020arXiv

Understanding Power Consumption and Reliability of High-Bandwidth Memory with Voltage Underscaling

Modern computing devices employ High-Bandwidth Memory (HBM) to meet their memory bandwidth requirements. An HBM-enabled device consists of multiple DRAM layers stacked on top of one another next to a compute chip (e.g. CPU, GPU, and FPGA) in the same package. Although such HBM structures provide high bandwidth at a small form factor, the stacked memory layers consume a substantial portion of the package's power budget. Therefore, power-saving techniques that preserve the performance of HBM are desirable. Undervolting is one such technique: it reduces the supply voltage to decrease power consumption without reducing the device's operating frequency to avoid performance loss. Undervolting takes advantage of voltage guardbands put in place by manufacturers to ensure correct operation under all environmental conditions. However, reducing voltage without changing frequency can lead to reliability issues manifested as unwanted bit flips. In this paper, we provide the first experimental study of real HBM chips under reduced-voltage conditions. We show that the guardband regions for our HBM chips constitute 19% of the nominal voltage. Pushing the supply voltage down within the guardband region reduces power consumption by a factor of 1.5X for all bandwidth utilization rates. Pushing the voltage down further by 11% leads to a total of2.3X power savings at the cost of unwanted bit flips. We explore and characterize the rate and types of these reduced-voltage-induced bit flips and present a fault map that enables the possibility of a three-factor trade-off among power, memory capacity, and fault rate.

preprint2020arXiv

Data Criticality in Multi-Threaded Applications: An Insight for Many-Core Systems

Multi-threaded applications are capable of exploiting the full potential of many-core systems. However, Network-on-Chip (NoC) based inter-core communication in many-core systems is responsible for 60-75% of the miss latency experienced by multi-threaded applications. Delay in the arrival of critical data at the requesting core severely hampers performance. This brief presents some interesting insights about how critical data is requested from the memory by multi-threaded applications. Then it investigates the cause of delay in NoC and how it affects the performance. Finally, this brief shows how NoC-aware memory access optimisations can significantly improve performance. Our experimental evaluation considers early restart memory access optimisation and demonstrates that by exploiting NoC resources, critical data can be prioritised to reduce miss penalty by 10-12% and improve system performance by 7-11%.

preprint2021arXiv

Design of a Dynamic Parameter-Controlled Chaotic-PRNG in a 65nm CMOS process

In this paper, we present the design of a new chaotic map circuit with a 65nm CMOS process. This chaotic map circuit uses a dynamic parameter-control topology and generates a wide chaotic range. We propose two designs of dynamic parameter-controlled chaotic map (DPCCM)-based pseudo-random number generators (PRNG). The randomness of the generated sequence is verified using three different statistical tests, namely, NIST SP 800-22 test, FIPS PUB 140-2 test, and Diehard test. Our first design offers a throughput of 200 MS/s with an on-chip area of 0.024mm2 and a power consumption of 2.33mW. The throughput of our second design is 300 MS/s with an area consumption of 0.132mm2 and power consumption of 2.14mW. The wider chaotic range and lower-overhead, offered by our designs, can be highly suitable for various applications such as, logic obfuscation, chaos-based cryptography, re-configurable random number generation,and hard-ware security for resource-constrained edge devices like IoT.

preprint2021arXiv

DB4HLS: A Database of High-Level Synthesis Design Space Explorations

High-Level Synthesis (HLS) frameworks allow to easily specify a large number of variants of the same hardware design by only acting on optimization directives. Nonetheless, the hardware synthesis of implementations for all possible combinations of directive values is impractical even for simple designs. Addressing this shortcoming, many HLS Design Space Exploration (DSE) strategies have been proposed to devise directive settings leading to high-quality implementations while limiting the number of synthesis runs. All these works require considerable efforts to validate the proposed strategies and/or to build the knowledge base employed to tune abstract models, as both tasks mandate the syntheses of large collections of implementations. Currently, such data gathering is performed ad-hoc, a) leading to a lack of standardization, hampering comparisons between DSE alternatives, and b) posing a very high burden to researchers willing to develop novel DSE strategies. Against this backdrop, we here introduce DB4HLS, a database of exhaustive HLS explorations comprising more than 100000 design points collected over 4 years of synthesis time. The open structure of DB4HLS allows the incremental integration of new DSEs, which can be easily defined with a dedicated domain-specific language. We think that of our database, available at https://www.db4hls.inf.usi.ch/, will be a valuable tool for the research community investigating automated strategies for the optimization of HLS-based hardware designs.

preprint2021arXiv

Robust Machine Learning Systems: Challenges, Current Trends, Perspectives, and the Road Ahead

Machine Learning (ML) techniques have been rapidly adopted by smart Cyber-Physical Systems (CPS) and Internet-of-Things (IoT) due to their powerful decision-making capabilities. However, they are vulnerable to various security and reliability threats, at both hardware and software levels, that compromise their accuracy. These threats get aggravated in emerging edge ML devices that have stringent constraints in terms of resources (e.g., compute, memory, power/energy), and that therefore cannot employ costly security and reliability measures. Security, reliability, and vulnerability mitigation techniques span from network security measures to hardware protection, with an increased interest towards formal verification of trained ML models. This paper summarizes the prominent vulnerabilities of modern ML systems, highlights successful defenses and mitigation techniques against these vulnerabilities, both at the cloud (i.e., during the ML training phase) and edge (i.e., during the ML inference stage), discusses the implications of a resource-constrained design on the reliability and security of the system, identifies verification methodologies to ensure correct system behavior, and describes open research challenges for building secure and reliable ML systems at both the edge and the cloud.

preprint2021arXiv

Silicon Photonic Microring Based Chip-Scale Accelerator for Delayed Feedback Reservoir Computing

To perform temporal and sequential machine learning tasks, the use of conventional Recurrent Neural Networks (RNNs) has been dwindling due to the training complexities of RNNs. To this end, accelerators for delayed feedback reservoir computing (DFRC) have attracted attention in lieu of RNNs, due to their simple hardware implementations. A typical implementation of a DFRC accelerator consists of a delay loop and a single nonlinear neuron, together acting as multiple virtual nodes for computing. In prior work, photonic DFRC accelerators have shown an undisputed advantage of fast computation over their electronic counterparts. In this paper, we propose a more energy-efficient chip-scale DFRC accelerator that employs a silicon photonic microring (MR) based nonlinear neuron along with on-chip photonic waveguides-based delayed feedback loop. Our evaluations show that, compared to a well-known photonic DFRC accelerator from prior work, our proposed MR-based DFRC accelerator achieves 35% and 98.7% lower normalized root mean square error (NRMSE), respectively, for the prediction tasks of NARMA10 and Santa Fe time series. In addition, our MR-based DFRC accelerator achieves 58.8% lower symbol error rate (SER) for the Non-Linear Channel Equalization task. Moreover, our MR-based DFRC accelerator has 98% and 93% faster training time, respectively, compared to an electronic and a photonic DFRC accelerators from prior work.

preprint2021arXiv

An Investigation on Inherent Robustness of Posit Data Representation

As the dimensions and operating voltages of computer electronics shrink to cope with consumers' demand for higher performance and lower power consumption, circuit sensitivity to soft errors increases dramatically. Recently, a new data-type is proposed in the literature called posit data type. Posit arithmetic has absolute advantages such as higher numerical accuracy, speed, and simpler hardware design than IEEE 754-2008 technical standard-compliant arithmetic. In this paper, we propose a comparative robustness study between 32-bit posit and 32-bit IEEE 754-2008 compliant representations. At first, we propose a theoretical analysis for IEEE 754 compliant numbers and posit numbers for single bit flip and double bit flips. Then, we conduct exhaustive fault injection experiments that show a considerable inherent resilience in posit format compared to classical IEEE 754 compliant representation. To show a relevant use-case of fault-tolerant applications, we perform experiments on a set of machine-learning applications. In more than 95% of the exhaustive fault injection exploration, posit representation is less impacted by faults than the IEEE 754 compliant floating-point representation. Moreover, in 100% of the tested machine-learning applications, the accuracy of posit-implemented systems is higher than the classical floating-point-based ones.

preprint2021arXiv

Best CNTFET Ternary Adders?

The MUX implementation of ternary half adders and full adders using predecessor and successor functions lead to the most efficient efficient implementation using the smallest transistor count. These designs are compared with the binary implementation of the corresponding half adders and full adders using the MUX technique or the typical complementary CMOS circuit style. The transistor count ratio between ternary and binary implementations is always greater than the information ratio ($log_2(3)/log_2(2)$ = 1.585) between ternary and binary wires.

preprint2021arXiv

Advancing Computing's Foundation of US Industry & Society

While past information technology (IT) advances have transformed society, future advances hold even greater promise. For example, we have only just begun to reap the changes from artificial intelligence (AI), especially machine learning (ML). Underlying IT's impact are the dramatic improvements in computer hardware, which deliver performance that unlock new capabilities. For example, recent successes in AI/ML required the synergy of improved algorithms and hardware architectures (e.g., general-purpose graphics processing units). However, unlike in the 20th Century and early 2000s, tomorrow's performance aspirations must be achieved without continued semiconductor scaling formerly provided by Moore's Law and Dennard Scaling. How will one deliver the next 100x improvement in capability at similar or less cost to enable great value? Can we make the next AI leap without 100x better hardware? This whitepaper argues for a multipronged effort to develop new computing approaches beyond Moore's Law to advance the foundation that computing provides to US industry, education, medicine, science, and government. This impact extends far beyond the IT industry itself, as IT is now central for providing value across society, for example in semi-autonomous vehicles, tele-education, health wearables, viral analysis, and efficient administration. Herein we draw upon considerable visioning work by CRA's Computing Community Consortium (CCC) and the IEEE Rebooting Computing Initiative (IEEE RCI), enabled by thought leader input from industry, academia, and the US government.

preprint2021arXiv

A Framework for Fast Scalable BNN Inference using Googlenet and Transfer Learning

Efficient and accurate object detection in video and image analysis is one of the major beneficiaries of the advancement in computer vision systems with the help of deep learning. With the aid of deep learning, more powerful tools evolved, which are capable to learn high-level and deeper features and thus can overcome the existing problems in traditional architectures of object detection algorithms. The work in this thesis aims to achieve high accuracy in object detection with good real-time performance. In the area of computer vision, a lot of research is going into the area of detection and processing of visual information, by improving the existing algorithms. The binarized neural network has shown high performance in various vision tasks such as image classification, object detection, and semantic segmentation. The Modified National Institute of Standards and Technology database (MNIST), Canadian Institute for Advanced Research (CIFAR), and Street View House Numbers (SVHN) datasets are used which is implemented using a pre-trained convolutional neural network (CNN) that is 22 layers deep. Supervised learning is used in the work, which classifies the particular dataset with the proper structure of the model. In still images, to improve accuracy, Googlenet is used. The final layer of the Googlenet is replaced with the transfer learning to improve the accuracy of the Googlenet. At the same time, the accuracy in moving images can be maintained by transfer learning techniques. Hardware is the main backbone for any model to obtain faster results with a large number of datasets. Here, Nvidia Jetson Nano is used which is a graphics processing unit (GPU), that can handle a large number of computations in the process of object detection. Results show that the accuracy of objects detected by the transfer learning method is more when compared to the existing methods.

preprint2021arXiv

BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

preprint2021arXiv

A Low Power In-Memory Multiplication andAccumulation Array with Modified Radix-4 Inputand Canonical Signed Digit Weights

A mass of data transfer between the processing and storage units has been the leading bottleneck in modern Von-Neuman computing systems, especially when used for Artificial Intelligence (AI) tasks. Computing-in-Memory (CIM) has shown great potential to reduce both latency and power consumption. However, the conventional analog CIM schemes are suffering from reliability issues, which may significantly degenerate the accuracy of the computation. Recently, CIM schemes with digitized input data and weights have been proposed for high reliable computing. However, the properties of the digital memory and input data are not fully utilized. This paper presents a novel low power CIM scheme to further reduce the power consumption by using a Modified Radix-4 (M-RD4) booth algorithm at the input and a Modified Canonical Signed Digit (M-CSD) for the network weights. The simulation results show that M-Rd4 and M-CSD reduce the ratio of $1\times1$ by 78.5\% on LeNet and 80.2\% on AlexNet, and improve the computing efficiency by 41.6\% in average. The computing-power rate at the fixed-point 8-bit is 60.68 TOPS/s/W.

preprint2021arXiv

Neural Storage: A New Paradigm of Elastic Memory

Storage and retrieval of data in a computer memory plays a major role in system performance. Traditionally, computer memory organization is static - i.e., they do not change based on the application-specific characteristics in memory access behaviour during system operation. Specifically, the association of a data block with a search pattern (or cues) as well as the granularity of a stored data do not evolve. Such a static nature of computer memory, we observe, not only limits the amount of data we can store in a given physical storage, but it also misses the opportunity for dramatic performance improvement in various applications. On the contrary, human memory is characterized by seemingly infinite plasticity in storing and retrieving data - as well as dynamically creating/updating the associations between data and corresponding cues. In this paper, we introduce Neural Storage (NS), a brain-inspired learning memory paradigm that organizes the memory as a flexible neural memory network. In NS, the network structure, strength of associations, and granularity of the data adjust continuously during system operation, providing unprecedented plasticity and performance benefits. We present the associated storage/retrieval/retention algorithms in NS, which integrate a formalized learning process. Using a full-blown operational model, we demonstrate that NS achieves an order of magnitude improvement in memory access performance for two representative applications when compared to traditional content-based memory.

preprint2021arXiv

High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers

This paper presents a workflow for synthesizing near-optimal FPGA implementations for structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class, its computation-communication pattern, and the architectural capabilities of the FPGA to accelerate solvers from the high-performance computing domain. Key new features of the workflow are (1) the unification of standard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling, motivated by increasing throughput for real-world work loads and (2) the development and use of a predictive analytic model for exploring the design space, resource estimates and performance. Three representative applications are implemented using the design workflow on a Xilinx Alveo U280 FPGA, demonstrating near-optimal performance and over 85% predictive model accuracy. These are compared with equivalent highly-optimized implementations of the same applications on modern HPC-grade GPUs (Nvidia V100) analyzing time to solution, bandwidth and energy consumption. Performance results indicate equivalent runtime performance of the FPGA implementations to the V100 GPU, with over 2x energy savings, for the largest non-trivial application synthesized on the FPGA compared to the best performing GPU-based solution. Our investigation shows the considerable challenges in gaining high performance on current generation FPGAs compared to traditional architectures. We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance.

preprint2021arXiv

Exploring Fault-Energy Trade-offs in Approximate DNN Hardware Accelerators

Systolic array-based deep neural network (DNN) accelerators have recently gained prominence for their low computational cost. However, their high energy consumption poses a bottleneck to their deployment in energy-constrained devices. To address this problem, approximate computing can be employed at the cost of some tolerable accuracy loss. However, such small accuracy variations may increase the sensitivity of DNNs towards undesired subtle disturbances, such as permanent faults. The impact of permanent faults in accurate DNNs has been thoroughly investigated in the literature. Conversely, the impact of permanent faults in approximate DNN accelerators (AxDNNs) is yet under-explored. The impact of such faults may vary with the fault bit positions, activation functions and approximation errors in AxDNN layers. Such dynamacity poses a considerable challenge to exploring the trade-off between their energy efficiency and fault resilience in AxDNNs. Towards this, we present an extensive layer-wise and bit-wise fault resilience and energy analysis of different AxDNNs, using the state-of-the-art Evoapprox8b signed multipliers. In particular, we vary the stuck-at-0, stuck-at-1 fault-bit positions, and activation functions to study their impact using the most widely used MNIST and Fashion-MNIST datasets. Our quantitative analysis shows that the permanent faults exacerbate the accuracy loss in AxDNNs when compared to the accurate DNN accelerators. For instance, a permanent fault in AxDNNs can lead up to 66\% accuracy loss, whereas the same faulty bit can lead to only 9\% accuracy loss in an accurate DNN accelerator. Our results demonstrate that the fault resilience in AxDNNs is orthogonal to the energy efficiency.

preprint2021arXiv

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Loop compilation for Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, entails solving NP-hard problems, yet depends on the loop bounds and number of available processing elements (PEs), parameters known only at runtime because of dynamic resource management and input sizes. Therefore, this article proposes a two-phase approach called symbolic loop compilation: At compile time, the necessary NP-complete problems are solved and the solutions compiled into a space-efficient symbolic configuration. At runtime, a concrete configuration is generated from the symbolic configuration according to the parameters values. We show that the latter phase, called instantiation, runs in polynomial time with its most complex step, program instantiation, not depending on the number of PEs. As validation, we performed symbolic loop compilation on real-world loops and measured time and space requirements. Our experiments confirm that a symbolic configuration is space-efficient and suited for systems with little memory -- often, a symbolic configuration is smaller than a single concrete configuration -- and that program instantiation scales well with the number of PEs -- for example, when instantiating a symbolic configuration of a matrix-matrix multiplication, the execution time is similar for $4\times 4$ and $32\times 32$ PEs.

preprint2021arXiv

EXMA: A Genomics Accelerator for Exact-Matching

Genomics is the foundation of precision medicine, global food security and virus surveillance. Exact-match is one of the most essential operations widely used in almost every step of genomics such as alignment, assembly, annotation, and compression. Modern genomics adopts Ferragina-Manzini Index (FM-Index) augmenting space-efficient Burrows-Wheeler transform (BWT) with additional data structures to permit ultra-fast exact-match operations. However, FM-Index is notorious for its poor spatial locality and random memory access pattern. Prior works create GPU-, FPGA-, ASIC- and even process-in-memory (PIM)-based accelerators to boost FM-Index search throughput. Though they achieve the state-of-the-art FM-Index search throughput, the same as all prior conventional accelerators, FM-Index PIMs process only one DNA symbol after each DRAM row activation, thereby suffering from poor memory bandwidth utilization. In this paper, we propose a hardware accelerator, EXMA, to enhance FM-Index search throughput. We first create a novel EXMA table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation. We then build an accelerator to search over an EXMA table. We propose 2-stage scheduling to increase the cache hit rate of our accelerator. We introduce dynamic page policy to improve the row buffer hit rate of DRAM main memory. We also present CHAIN compression to reduce the data structure size of EXMA tables. Compared to state-of-the-art FM-Index PIMs, EXMA improves search throughput by $4.9\times$, and enhances search throughput per Watt by $4.8\times$.

preprint2021arXiv

Towards Creating a Deployable Grasp Type Probability Estimator for a Prosthetic Hand

For lower arm amputees, prosthetic hands promise to restore most of physical interaction capabilities. This requires to accurately predict hand gestures capable of grabbing varying objects and execute them timely as intended by the user. Current approaches often rely on physiological signal inputs such as Electromyography (EMG) signal from residual limb muscles to infer the intended motion. However, limited signal quality, user diversity and high variability adversely affect the system robustness. Instead of solely relying on EMG signals, our work enables augmenting EMG intent inference with physical state probability through machine learning and computer vision method. To this end, we: (1) study state-of-the-art deep neural network architectures to select a performant source of knowledge transfer for the prosthetic hand, (2) use a dataset containing object images and probability distribution of grasp types as a new form of labeling where instead of using absolute values of zero and one as the conventional classification labels, our labels are a set of probabilities whose sum is 1. The proposed method generates probabilistic predictions which could be fused with EMG prediction of probabilities over grasps by using the visual information from the palm camera of a prosthetic hand. Our results demonstrate that InceptionV3 achieves highest accuracy with 0.95 angular similarity followed by 1.4 MobileNetV2 with 0.93 at ~20% the amount of operations.

People in this topic

12 visible researcher(s)