Topic overview

Hardware Architecture

1095 works3776 researchers

Open map Browse papers

Map preview

Start with the graph, then narrow the list

1095works

3776researchers

Next steps

Use the topic as a working map

Open the full map for clusters, then return here to scan ranked papers and people.

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Memory Efficient Multi-Scale Line Detector Architecture for Retinal Blood Vessel Segmentation

This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilization by reusing the computations and optimizing the bit-width. The throughput is increased by designing fully pipelined functional units. The architecture is capable of achieving a comparable accuracy to its software implementation but 70x faster for low resolution images. For high resolution images, it achieves an acceleration by a factor of 323x.

preprint2016arXiv

Application-aware Retiming of Accelerators: A High-level Data-driven Approach

Flexibility at hardware level is the main driving force behind adaptive systems whose aim is to realise microarhitecture deconfiguration 'online'. This feature allows the software/hardware stack to tolerate drastic changes of the workload in data centres. With emerge of FPGA reconfigurablity this technology is becoming a mainstream computing paradigm. Adaptivity is usually accompanied by the high-level tools to facilitate multi-dimensional space exploration. An essential aspect in this space is memory orchestration where on-chip and off-chip memory distribution significantly influences the architecture in coping with the critical spatial and timing constraints, e.g. Place and Route. This paper proposes a memory smart technique for a particular class of adaptive systems: Elastic Circuits which enjoy slack elasticity at fine level of granularity. We explore retiming of a set of popular benchmarks via investigating the memory distribution within and among accelerators. The area, performance and power patterns are adopted by our high-level synthesis framework, with respect to the behaviour of the input descriptions, to improve the quality of the synthesised elastic circuits.

preprint2016arXiv

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 μs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 μs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.

preprint2016arXiv

Copycat: A High Precision Real Time NAND Simulator

In this paper, we describe the design and implementation of a high precision real time NAND simulator called Copycat that runs on a commodity multi-core desktop environment. This NAND simulator facilitates the development of embedded flash memory management software such as the flash translation layer (FTL). The simulator also allows a comprehensive fault injection for testing the reliability of the FTL. Compared against a real FPGA implementation, the simulator's response time deviation is under 0.28% on average, with a maximum of 10.12%.

preprint2016arXiv

NOP - A Simple Experimental Processor for Parallel Deployment

The design of a parallel computing system using several thousands or even up to a million processors asks for processing units that are simple and thus small in space, to make as many processing units as possible fit on a single die. The design presented herewith is far from being optimised, it is not meant to compete with industry performance devices. Its main purpose is to allow for a prototypical implementation of a dynamic software system as a proof of concept.

preprint2016arXiv

Prototyping RISC Based, Reconfigurable Networking Applications in Open Source

In the last decade we have witnessed a rapid growth in data center systems, requiring new and highly complex networking devices. The need to refresh networking infrastructure whenever new protocols or functions are introduced, and the increasing costs that this entails, are of a concern to all data center providers. New generations of Systems on Chip (SoC), integrating microprocessors and higher bandwidth interfaces, are an emerging solution to this problem. These devices permit entirely new systems and architectures that can obviate the replacement of existing networking devices while enabling seamless functionality change. In this work, we explore open source, RISC based, SoC architectures with high performance networking capabilities. The prototype architectures are implemented on the NetFPGA-SUME platform. Beyond details of the architecture, we also describe the hardware implementation and the porting of operating systems to the platform. The platform can be exploited for the development of practical networking appliances, and we provide use case examples.

preprint2016arXiv

A Novel RTL ATPG Model Based on Gate Inherent Faults (GIF-PO) of Complex Gates

This paper starts with a comprehensive survey on RTL ATPG. It then proposes a novel RTL ATPG model based on "Gate Inherent Faults" (GIF). These GIF are extracted from each complex gate (adder, case-statement, etc.) of the RTL source code individually. They are related to the internal logic paths of a complex gate. They are not related to any net/signal in the RTL design. It is observed, that when all GIF on RTL are covered (100%) and the same stimulus is applied, then all gate level stuck-at faults of the netlist are covered (100%) as well. The proposed RTL ATPG model is therefore synthesis independent. This is shown on ITC'99 testcases. The applied semi-automatic test pattern generation process is based on functional simulation.

preprint2016arXiv

HADES: Microprocessor Hazard Analysis via Formal Verification of Parameterized Systems

HADES is a fully automated verification tool for pipeline-based microprocessors that aims at flaws caused by improperly handled data hazards. It focuses on single-pipeline microprocessors designed at the register transfer level (RTL) and deals with read-after-write, write-after-write, and write-after-read hazards. HADES combines several techniques, including data-flow analysis, error pattern matching, SMT solving, and abstract regular model checking. It has been successfully tested on several microprocessors for embedded applications.

preprint2016arXiv

A 700uW 1GS/s 4-bit Folding-Flash ADC in 65nm CMOS for Wideband Wireless Communications

We present the design of a low-power 4-bit 1GS/s folding-flash ADC with a folding factor of two. The design of a new unbalanced double-tail dynamic comparator affords an ultra-low power operation and a high dynamic range. Unlike the conventional approaches, this design uses a fully matched input stage, an unbalanced latch stage, and a two-clock operation scheme. A combination of these features yields significant reduction of the kick-back noise, while allowing the design flexibility for adjusting the trip points of the comparators. As a result, the ADC achieves SNDR of 22.3 dB at 100MHz and 21.8 dB at 500MHz (i.e. the Nyquist frequency). The maximum INL and DNL are about 0.2 LSB. The converter consumes about 700uW from a 1-V supply yielding a figure of merit of 65fJ/conversion step. These attributes make the proposed folding-flash ADC attractive for the next-generation wireless applications.

preprint2016arXiv

An Artificial Neural Networks based Temperature Prediction Framework for Network-on-Chip based Multicore Platform

Continuous improvement in silicon process technologies has made possible the integration of hundreds of cores on a single chip. However, power and heat have become dominant constraints in designing these massive multicore chips causing issues with reliability, timing variations and reduced lifetime of the chips. Dynamic Thermal Management (DTM) is a solution to avoid high temperatures on the die. Typical DTM schemes only address core level thermal issues. However, the Network-on-chip (NoC) paradigm, which has emerged as an enabling methodology for integrating hundreds to thousands of cores on the same die can contribute significantly to the thermal issues. Moreover, the typical DTM is triggered reactively based on temperature measurements from on-chip thermal sensor requiring long reaction times whereas predictive DTM method estimates future temperature in advance, eliminating the chance of temperature overshoot. Artificial Neural Networks (ANNs) have been used in various domains for modeling and prediction with high accuracy due to its ability to learn and adapt. This thesis concentrates on designing an ANN prediction engine to predict the thermal profile of the cores and Network-

preprint2016arXiv

Arch2030: A Vision of Computer Architecture Research over the Next 15 Years

Application trends, device technologies and the architecture of systems drive progress in information technologies. However, the former engines of such progress - Moore's Law and Dennard Scaling - are rapidly reaching the point of diminishing returns. The time has come for the computing community to boldly confront a new challenge: how to secure a foundational future for information technology's continued progress. The computer architecture community engaged in several visioning exercises over the years. Five years ago, we released a white paper, 21st Century Computer Architecture, which influenced funding programs in both academia and industry. More recently, the IEEE Rebooting Computing Initiative explored the future of computing systems in the architecture, device, and circuit domains. This report stems from an effort to continue this dialogue, reach out to the applications and devices/circuits communities, and understand their trends and vision. We aim to identify opportunities where architecture research can bridge the gap between the application and device domains.

preprint2016arXiv

Novel Graph Processor Architecture, Prototype System, and Results

Graph algorithms are increasingly used in applications that exploit large databases. However, conventional processor architectures are inadequate for handling the throughput and memory requirements of graph computation. Lincoln Laboratory's graph-processor architecture represents a rethinking of parallel architectures for graph problems. Our processor utilizes innovations that include a sparse matrix-based graph instruction set, a cacheless memory system, accelerator-based architecture, a systolic sorter, high-bandwidth multi-dimensional toroidal communication network, and randomized communications. A field-programmable gate array (FPGA) prototype of the new graph processor has been developed with significant performance enhancement over conventional processors in graph computational throughput.

preprint2016arXiv

Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth. We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components make Buddy functionally complete. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1% of DRAM chip area). Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provide

preprint2016arXiv

FPGA Based Implementation of Distributed Minority and Majority Voting Based Redundancy for Mission and Safety-Critical Applications

Electronic circuits and systems used in mission and safety-critical applications usually employ redundancy in the design to overcome arbitrary fault(s) or failure(s) and guarantee the correct operation. In this context, the distributed minority and majority voting based redundancy (DMMR) scheme forms an efficient alternative to the conventional N-modular redundancy (NMR) scheme for implementing mission and safety-critical circuits and systems by significantly minimizing their weight and design cost and also their design metrics whilst providing a similar degree of fault tolerance. This article presents the first FPGAs based implementation of example DMMR circuits and compares it with counterpart NMR circuits on the basis of area occupancy and critical path delay viz. area-delay product (ADP). The example DMMR circuits and counterpart NMR circuits are able to accommodate the faulty or failure states of 2, 3 and 4 function modules. For physical synthesis, two commercial Xilinx FPGAs viz. Spartan 3E and Virtex 5 corresponding to 90nm and 65nm CMOS processes, and two radiation-tolerant and military grade Xilinx FPGAs viz. QPro Virtex 2 and QPro Virtex E corresponding to 150nm and 180nm

preprint2016arXiv

Memory Controller Design Under Cloud Workloads

This work studies the behavior of state-of-the-art memory controller designs when executing scale-out workloads. It considers memory scheduling techniques, memory page management policies, the number of memory channels, and the address mapping scheme used. Experimental measurements demonstrate: 1)~Several recently proposed memory scheduling policies are not a good match for these scale-out workloads. 2)~The relatively simple First-Ready-First-Come-First-Served (FR-FCFS) policy performs consistently better, and 3)~for most of the studied workloads, the even simpler First-Come-First-Served scheduling policy is within 1\% of FR-FCFS. 4)~Increasing the number of memory channels offers negligible performance benefits, e.g., performance improves by 1.7\% on average for 4-channels vs. 1-channel. 5)~77\%-90\% of DRAM rows activations are accessed only once before closure. These observation can guide future development and optimization of memory controllers for scale-out workloads.

preprint2016arXiv

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) the performance is merely 5 to 7% of the theoretical peak performance in multicores and GPGPUs respectively. Achieving performance in BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS through algorithm-architecture co-design. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units

preprint2014arXiv

Multi Core SSL/TLS Security Processor Architecture Prototype Design with automated Preferential Algorithm in FPGA

In this paper a pipelined architecture of a high speed network security processor (NSP) for SSL,TLS protocol is implemented on a system on chip (SOC) where hardware information of all encryption, hashing and key exchange algorithms are stored in flash memory in terms of bit files, in contrary to related works where all are actually implemented in hardware. The NSP finds applications in e-commerce, virtual private network (VPN) and in other fields that require data confidentiality. The motivation of the present work is to dynamically execute applications with stipulated throughput within budgeted hardware resource and power. A preferential algorithm choosing an appropriate cipher suite is proposed, which is based on Efficient System Index (ESI) budget comprising of power, throughput and resource given by the user. The bit files of the chosen security algorithms are downloaded from the flash memory to the partial region of field programmable gate array (FPGA). The proposed SOC controls data communication between an application running in a system through a PCI and the Ethernet interface of a network. Partial configuration feature is used in ISE14.4 suite with ZYNQ 7z020-clg484 FPGA p

preprint2016arXiv

Can Broken Multicore Hardware be Mended?

A suggestion is made for mending multicore hardware, which has been diagnosed as broken.

preprint2016arXiv

High-performance K-means Implementation based on a Simplified Map-Reduce Architecture

The k-means algorithm is one of the most common clustering algorithms and widely used in data mining and pattern recognition. The increasing computational requirement of big data applications makes hardware acceleration for the k-means algorithm necessary. In this paper, a simplified Map-Reduce architecture is proposed to implement the k-means algorithm on an FPGA. Algorithmic segmentation, data path elaboration and automatic control are applied to optimize the architecture for high performance. In addition, high level synthesis technique is utilized to reduce development cycles and complexity. For a single iteration in the k-means algorithm, a throughput of 28.74 Gbps is achieved. The performance shows at least 3.93x speedup compared with four representative existing FPGA-based implementations and can satisfy the demand of big data applications.

preprint2010arXiv

Memristor Crossbar-based Hardware Implementation of IDS Method

Ink Drop Spread (IDS) is the engine of Active Learning Method (ALM), which is the methodology of soft computing. IDS, as a pattern-based processing unit, extracts useful information from a system subjected to modeling. In spite of its excellent potential in solving problems such as classification and modeling compared to other soft computing tools, finding its simple and fast hardware implementation is still a challenge. This paper describes a new hardware implementation of IDS method based on the memristor crossbar structure. In addition of simplicity, being completely real-time, having low latency and the ability to continue working after the occurrence of power breakdown are some of the advantages of our proposed circuit.

preprint2013arXiv

A Low-Voltage, Low-Power 4-bit BCD Adder, designed using the Clock Gated Power Gating, and the DVT Scheme

This paper proposes a Low-Power, Energy Efficient 4-bit Binary Coded Decimal (BCD) adder design where the conventional 4-bit BCD adder has been modified with the Clock Gated Power Gating Technique. Moreover, the concept of DVT (Dual-vth) scheme has been introduced while designing the full adder blocks to reduce the Leakage Power, as well as, to maintain the overall performance of the entire circuit. The reported architecture of 4-bit BCD adder is designed using 45 nm technology and it consumes 1.384 μWatt of Average Power while operating with a frequency of 200 MHz, and a Supply Voltage (Vdd) of 1 Volt. The results obtained from different simulation runs on SPICE, indicate the superiority of the proposed design compared to the conventional 4-bit BCD adder. Considering the product of Average Power and Delay, for the operating frequency of 200 MHz, a fair 47.41 % reduction compared to the conventional design has been achieved with this proposed scheme.

preprint2011arXiv

Brain-like infrastructure for embedded SoC diagnosis

This article describes high-speed multiprocessor architecture for the concurrent analyzing information represented in analytic, graph- and table forms of associative relations to search, recognize and make a decision in n-dimensional vector discrete space. Vector-logical process models of actual applications,for which the quality of solution is estimated by the proposed integral non-arithmetical metric of the interaction between Boolean vectors, are described.

preprint2009arXiv

Turbo NOC: a framework for the design of Network On Chip based turbo decoder architectures

This work proposes a general framework for the design and simulation of network on chip based turbo decoder architectures. Several parameters in the design space are investigated, namely the network topology, the parallelism degree, the rate at which messages are sent by processing nodes over the network and the routing strategy. The main results of this analysis are: i) the most suited topologies to achieve high throughput with a limited complexity overhead are generalized de-Bruijn and generalized Kautz topologies; ii) depending on the throughput requirements different parallelism degrees, message injection rates and routing algorithms can be used to minimize the network area overhead.

preprint2016arXiv

Fast and reconfigurable packet classification engine in FPGA-based firewall

In data communication via internet, security is becoming one of the most influential aspects. One way to support it is by classifying and filtering ethernet packets within network devices. Packet classification is a fundamental task for network devices such as routers, firewalls, and intrusion detection systems. In this paper we present architecture of fast and reconfigurable Packet Classification Engine (PCE). This engine is used in FPGA-based firewall. Our PCE inspects multi-dimensional field of packet header sequentially based on tree-based algorithm. This algorithm simplifies overall system to a lower scale and leads to a more secure system. The PCE works with an adaptation of single cycle processor architecture in the system. Ethernet packet is examined with PCE based on Source IP Address, Destination IP Address, Source Port, Destination Port, and Protocol fields of the packet header. These are basic fields to know whether it is a dangerous or normal packet before inspecting the content. Using implementation of tree-based algorithm in the architecture, firewall rules are rebuilt into 24-bit sub-rules which are read as processor instruction in the inspection process. The inspec

447 works