Source author record

Marian Verhelst

Marian Verhelst appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Computer Vision Hardware Architecture Machine Learning Computation and Language Information Theory math.IT math.NA Numerical Analysis Performance

Catalog footprint

What is connected

12works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resource-constrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ~80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ~87-94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head self-attention inference by a factor of ~7.5-16.

preprint2022arXiv

GRAPHOPT: constrained-optimization-based parallelization of irregular graphs

Sparse, irregular graphs show up in various applications like linear algebra, machine learning, engineering simulations, robotic control, etc. These graphs have a high degree of parallelism, but their execution on parallel threads of modern platforms remains challenging due to the irregular data dependencies. The execution performance can be improved by efficiently partitioning the graphs such that the communication and thread synchronization overheads are minimized without hurting the utilization of the threads. To achieve this, this paper proposes GRAPHOPT, a tool that models the graph parallelization as a constrained optimization problem and uses the open Google OR-Tools solver to find good partitions. Several scalability techniques are developed to handle large real-world graphs with millions of nodes and edges. Extensive experiments are performed on the graphs of sparse matrix triangular solves (linear algebra) and sum-product networks (machine learning), respectively, showing a mean speedup of 2.0X and 1.8X over previous state-of-the-art libraries, demonstrating the effectiveness of the constrained-optimization-based graph parallelization.

preprint2022arXiv

Hardware-aware mobile building block evaluation for computer vision

In this work we propose a methodology to accurately evaluate and compare the performance of efficient neural network building blocks for computer vision in a hardware-aware manner. Our comparison uses pareto fronts based on randomly sampled networks from a design space to capture the underlying accuracy/complexity trade-offs. We show that our approach allows to match the information obtained by previous comparison paradigms, but provides more insights in the relationship between hardware cost and accuracy. We use our methodology to analyze different building blocks and evaluate their performance on a range of embedded hardware platforms. This highlights the importance of benchmarking building blocks as a preselection step in the design process of a neural network. We show that choosing the right building block can speed up inference by up to a factor of 2x on specific hardware ML accelerators.

preprint2022arXiv

Taxonomy and Benchmarking of Precision-Scalable MAC Arrays Under Enhanced DNN Dataflow Representation

Reduced-precision and variable-precision multiply-accumulate (MAC) operations provide opportunities to significantly improve energy efficiency and throughput of DNN accelerators with no/limited algorithmic performance loss, paving a way towards deploying AI applications on resource-constraint edge devices. Accordingly, various precision-scalable MAC array (PSMA) architectures were proposed recently. However, it is difficult to make a fair comparison between those alternatives, as each proposed PSMA is demonstrated in different systems and technologies. This work aims to provide a clear view of the design space of PSMA and offer insights for selecting the optimal architectures based on designers' needs. First, we introduce a precision-enhanced for-loop representation for DNN dataflows. Next, we use this new representation towards a comprehensive PSMA taxonomy, capable of systematically covering most prominent state-of-the-art PSMAs, as well as uncovering new PSMA architectures. Following that, we build a highly parameterized PSMA template that can be design-time configured into a huge subset of the design space spanned by the taxonomy. This allows to fairly and thoroughly benchmark 72 different PSMA architectures. We perform such studies in 28nm technology targeting run-time precision scalability from 8 to 2 bits, operating at 200 MHz and 1 GHz. Analyzing resulting energy and area breakdowns reveals key design guidelines for PSMA architectures.

preprint2021arXiv

Acceleration of probabilistic reasoning through custom processor architecture

Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.

preprint2021arXiv

Benchmarking TinyML Systems: Challenges and Direction

Recent advancements in ultra-low-power machine learning (TinyML) hardware promises to unlock an entirely new class of smart applications. However, continued progress is limited by the lack of a widely accepted benchmark for these systems. Benchmarking allows us to measure and thereby systematically compare, evaluate, and improve the performance of systems and is therefore fundamental to a field reaching maturity. In this position paper, we present the current landscape of TinyML and discuss the challenges and direction towards developing a fair and useful hardware benchmark for TinyML workloads. Furthermore, we present our four benchmarks and discuss our selection methodology. Our viewpoints reflect the collective thoughts of the TinyMLPerf working group that is comprised of over 30 organizations.

preprint2021arXiv

ProbLP: A framework for low-precision probabilistic inference

Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.

preprint2020arXiv

Feed-Forward On-Edge Fine-tuning Using Static Synthetic Gradient Modules

Training deep learning models on embedded devices is typically avoided since this requires more memory, computation and power over inference. In this work, we focus on lowering the amount of memory needed for storing all activations, which are required during the backward pass to compute the gradients. Instead, during the forward pass, static Synthetic Gradient Modules (SGMs) predict gradients for each layer. This allows training the model in a feed-forward manner without having to store all activations. We tested our method on a robot grasping scenario where a robot needs to learn to grasp new objects given only a single demonstration. By first training the SGMs in a meta-learning manner on a set of common objects, during fine-tuning, the SGMs provided the model with accurate gradients to successfully learn to grasp new objects. We have shown that our method has comparable results to using standard backpropagation.

preprint2020arXiv

ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework

Building efficient embedded deep learning systems requires a tight co-design between DNN algorithms, memory hierarchy, and dataflow. However, owing to the large degrees of freedom in the design space, finding an optimal solution through the implementation of individual design points becomes infeasible. Recently, several estimation frameworks for fast design space exploration (DSE) have emerged, yet they either suffer from long runtimes or a limited exploration space. This work introduces ZigZag, a memory-centric rapid DNN accelerator DSE framework which extends the DSE with uneven mapping opportunities, in which operands at shared memory levels are no longer bound to use the same memory levels for each loop index. For this, ZigZag uses a memory-centric nested-for-loop format as a uniform representation to integrate algorithm, accelerator, and algorithm-to-accelerator mapping, and consists of three key components: 1) a latency-enhanced analytical Hardware Cost Estimator, 2) a Temporal Mapping Generator that supports even/uneven scheduling on any type of memory hierarchy, and 3) an Architecture Generator that explores the whole memory hierarchy design space. Benchmarking experiments against existing frameworks, together with three case studies at different design abstraction levels show the strength of ZigZag. Up to 33% more energy-efficient solutions are found by introducing ZigZag's uneven scheduling opportunities.

preprint2016arXiv

A 0.3-2.6 TOPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets

A low-power precision-scalable processor for ConvNets or convolutional neural networks (CNN) is implemented in a 40nm technology. Its 256 parallel processing units achieve a peak 102GOPS running at 204MHz. To minimize energy consumption while maintaining throughput, this works is the first to both exploit the sparsity of convolutions and to implement dynamic precision-scalability enabling supply- and energy scaling. The processor is fully C-programmable, consumes 25-288mW at 204 MHz and scales efficiency from 0.3-2.6 real TOPS/W. This system hereby outperforms the state-of-the-art up to 3.9x in energy efficiency.

preprint2016arXiv

Energy-Efficient ConvNets Through Approximate Computing

Recently ConvNets or convolutional neural networks (CNN) have come up as state-of-the-art classification and detection algorithms, achieving near-human performance in visual detection. However, ConvNet algorithms are typically very computation and memory intensive. In order to be able to embed ConvNet-based classification into wearable platforms and embedded systems such as smartphones or ubiquitous electronics for the internet-of-things, their energy consumption should be reduced drastically. This paper proposes methods based on approximate computing to reduce energy consumption in state-of-the-art ConvNet accelerators. By combining techniques both at the system- and circuit level, we can gain energy in the systems arithmetic: up to 30x without losing classification accuracy and more than 100x at 99% classification accuracy, compared to the commonly used 16-bit fixed point number format.

preprint2015arXiv

Understanding interdependency through complex information sharing

The interactions between three or more random variables are often nontrivial, poorly understood, and yet, are paramount for future advances in fields such as network information theory, neuroscience, genetics and many others. In this work, we propose to analyze these interactions as different modes of information sharing. Towards this end, we introduce a novel axiomatic framework for decomposing the joint entropy, which characterizes the various ways in which random variables can share information. The key contribution of our framework is to distinguish between interdependencies where the information is shared redundantly, and synergistic interdependencies where the sharing structure exists in the whole but not between the parts. We show that our axioms determine unique formulas for all the terms of the proposed decomposition for a number of cases of interest. Moreover, we show how these results can be applied to several network information theory problems, providing a more intuitive understanding of their fundamental limits.

Marian Verhelst

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

GRAPHOPT: constrained-optimization-based parallelization of irregular graphs

Hardware-aware mobile building block evaluation for computer vision

Taxonomy and Benchmarking of Precision-Scalable MAC Arrays Under Enhanced DNN Dataflow Representation

Acceleration of probabilistic reasoning through custom processor architecture

Benchmarking TinyML Systems: Challenges and Direction

ProbLP: A framework for low-precision probabilistic inference

Feed-Forward On-Edge Fine-tuning Using Static Synthetic Gradient Modules

ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework

A 0.3-2.6 TOPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets

Energy-Efficient ConvNets Through Approximate Computing

Understanding interdependency through complex information sharing