Researcher profile

Marian Verhelst

Marian Verhelst contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2022arXiv

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resource-constrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ~80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ~87-94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head self-attention inference by a factor of ~7.5-16.

preprint2022arXiv

GRAPHOPT: constrained-optimization-based parallelization of irregular graphs

Sparse, irregular graphs show up in various applications like linear algebra, machine learning, engineering simulations, robotic control, etc. These graphs have a high degree of parallelism, but their execution on parallel threads of modern platforms remains challenging due to the irregular data dependencies. The execution performance can be improved by efficiently partitioning the graphs such that the communication and thread synchronization overheads are minimized without hurting the utilization of the threads. To achieve this, this paper proposes GRAPHOPT, a tool that models the graph parallelization as a constrained optimization problem and uses the open Google OR-Tools solver to find good partitions. Several scalability techniques are developed to handle large real-world graphs with millions of nodes and edges. Extensive experiments are performed on the graphs of sparse matrix triangular solves (linear algebra) and sum-product networks (machine learning), respectively, showing a mean speedup of 2.0X and 1.8X over previous state-of-the-art libraries, demonstrating the effectiveness of the constrained-optimization-based graph parallelization.

preprint2022arXiv

Hardware-aware mobile building block evaluation for computer vision

In this work we propose a methodology to accurately evaluate and compare the performance of efficient neural network building blocks for computer vision in a hardware-aware manner. Our comparison uses pareto fronts based on randomly sampled networks from a design space to capture the underlying accuracy/complexity trade-offs. We show that our approach allows to match the information obtained by previous comparison paradigms, but provides more insights in the relationship between hardware cost and accuracy. We use our methodology to analyze different building blocks and evaluate their performance on a range of embedded hardware platforms. This highlights the importance of benchmarking building blocks as a preselection step in the design process of a neural network. We show that choosing the right building block can speed up inference by up to a factor of 2x on specific hardware ML accelerators.

preprint2022arXiv

Taxonomy and Benchmarking of Precision-Scalable MAC Arrays Under Enhanced DNN Dataflow Representation

Reduced-precision and variable-precision multiply-accumulate (MAC) operations provide opportunities to significantly improve energy efficiency and throughput of DNN accelerators with no/limited algorithmic performance loss, paving a way towards deploying AI applications on resource-constraint edge devices. Accordingly, various precision-scalable MAC array (PSMA) architectures were proposed recently. However, it is difficult to make a fair comparison between those alternatives, as each proposed PSMA is demonstrated in different systems and technologies. This work aims to provide a clear view of the design space of PSMA and offer insights for selecting the optimal architectures based on designers' needs. First, we introduce a precision-enhanced for-loop representation for DNN dataflows. Next, we use this new representation towards a comprehensive PSMA taxonomy, capable of systematically covering most prominent state-of-the-art PSMAs, as well as uncovering new PSMA architectures. Following that, we build a highly parameterized PSMA template that can be design-time configured into a huge subset of the design space spanned by the taxonomy. This allows to fairly and thoroughly benchmark 72 different PSMA architectures. We perform such studies in 28nm technology targeting run-time precision scalability from 8 to 2 bits, operating at 200 MHz and 1 GHz. Analyzing resulting energy and area breakdowns reveals key design guidelines for PSMA architectures.

preprint2021arXiv

Acceleration of probabilistic reasoning through custom processor architecture

Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.

preprint2021arXiv

Benchmarking TinyML Systems: Challenges and Direction

Recent advancements in ultra-low-power machine learning (TinyML) hardware promises to unlock an entirely new class of smart applications. However, continued progress is limited by the lack of a widely accepted benchmark for these systems. Benchmarking allows us to measure and thereby systematically compare, evaluate, and improve the performance of systems and is therefore fundamental to a field reaching maturity. In this position paper, we present the current landscape of TinyML and discuss the challenges and direction towards developing a fair and useful hardware benchmark for TinyML workloads. Furthermore, we present our four benchmarks and discuss our selection methodology. Our viewpoints reflect the collective thoughts of the TinyMLPerf working group that is comprised of over 30 organizations.

preprint2021arXiv

ProbLP: A framework for low-precision probabilistic inference

Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.

preprint2020arXiv

Feed-Forward On-Edge Fine-tuning Using Static Synthetic Gradient Modules

Training deep learning models on embedded devices is typically avoided since this requires more memory, computation and power over inference. In this work, we focus on lowering the amount of memory needed for storing all activations, which are required during the backward pass to compute the gradients. Instead, during the forward pass, static Synthetic Gradient Modules (SGMs) predict gradients for each layer. This allows training the model in a feed-forward manner without having to store all activations. We tested our method on a robot grasping scenario where a robot needs to learn to grasp new objects given only a single demonstration. By first training the SGMs in a meta-learning manner on a set of common objects, during fine-tuning, the SGMs provided the model with accurate gradients to successfully learn to grasp new objects. We have shown that our method has comparable results to using standard backpropagation.

preprint2020arXiv

ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework

Building efficient embedded deep learning systems requires a tight co-design between DNN algorithms, memory hierarchy, and dataflow. However, owing to the large degrees of freedom in the design space, finding an optimal solution through the implementation of individual design points becomes infeasible. Recently, several estimation frameworks for fast design space exploration (DSE) have emerged, yet they either suffer from long runtimes or a limited exploration space. This work introduces ZigZag, a memory-centric rapid DNN accelerator DSE framework which extends the DSE with uneven mapping opportunities, in which operands at shared memory levels are no longer bound to use the same memory levels for each loop index. For this, ZigZag uses a memory-centric nested-for-loop format as a uniform representation to integrate algorithm, accelerator, and algorithm-to-accelerator mapping, and consists of three key components: 1) a latency-enhanced analytical Hardware Cost Estimator, 2) a Temporal Mapping Generator that supports even/uneven scheduling on any type of memory hierarchy, and 3) an Architecture Generator that explores the whole memory hierarchy design space. Benchmarking experiments against existing frameworks, together with three case studies at different design abstraction levels show the strength of ZigZag. Up to 33% more energy-efficient solutions are found by introducing ZigZag's uneven scheduling opportunities.