Source author record

Francky Catthoor

Francky Catthoor appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture Neural and Evolutionary Computing Emerging Technologies Computer Vision eess.SY Logic in Computer Science Networking and Internet Architecture Programming Languages Robotics Systems and Control

Catalog footprint

What is connected

15works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Architectural Classification of XR Workloads: Cross-Layer Archetypes and Implications

Edge and mobile platforms for augmented and virtual reality, collectively referred to as extended reality (XR) must deliver deterministic ultra-low-latency performance under stringent power and area constraints. However, the diversity of XR workloads is rapidly increasing, characterized by heterogeneous operator types and complex dataflow structures. This trend poses significant challenges to conventional accelerator architectures centered around convolutional neural networks (CNNs), resulting in diminishing returns for traditional compute-centric optimization strategies. Despite the importance of this problem, a systematic architectural understanding of the full XR pipeline remains lacking. In this paper, we present an architectural classification of XR workloads using a cross-layer methodology that integrates model-based high-level design space exploration (DSE) with empirical profiling on commercial GPU and CPU hardware. By analyzing a representative set of workloads spanning 12 distinct XR kernels, we distill their complex architectural characteristics into a small set of cross-layer workload archetypes (e.g., capacity-limited and overhead-sensitive). Building on these archetypes, we further extract key architectural insights and provide actionable design guidelines for next-generation XR SoCs. Our study highlights that XR architecture design must shift from generic resource scaling toward phase-aware scheduling and elastic resource allocation in order to achieve greater energy efficiency and high performance in future XR systems.

preprint2023arXiv

Investigating methods to improve photovoltaic thermal models at second-to-minute timescales

This paper presents a range of methods to improve the accuracy of equation-based thermal models of PV modules at second-to-minute timescales. We present an RC-equivalent conceptual model for PV modules, where wind effects are captured. We show how the thermal time constant $τ$ of PV modules can be determined from measured data, and subsequently used to make static thermal models dynamic by applying the Exponential Weighted Mean (EWM) approach to irradiance and wind signals. On average, $τ$ is $6.3 \pm 1~$min for fixed-mount PV systems. Based on this conceptual model, the Filter- EWM - Mean Bias Error correction (FEM) methodology is developed. We propose two thermal models, WM1 and WM2, and compare these against the models of Ross, Sandia, and Faiman on twenty-four datasets of fifteen sites, with time resolutions ranging from 1$~$s to 1$~$h, the majority of these at 1$~$min resolution. The FEM methodology is shown to reduce model errors (RMSE and MAE) on average for all sites and models versus the standard steady-state equivalent by -1.1$~$K and -0.75$~$K respectively.

preprint2023arXiv

Timescales: the choreography of classical and unconventional computing

Tasks that one wishes to have done by a computer often come with conditions that relate to timescales. For instance, the processing must terminate within a given time limit; or a signal processing computer must integrate input information across several timescales; or a robot motor controller must react to sensor signals with a short latency. For classical digital computing machines such requirements pose no fundamental problems as long as the physical clock rate is fast enough for the fastest relevant timescales. However, when digital microchips become scaled down into the few-nanometer range where quantum noise and device mismatch become unavoidable, or when the digital computing paradigm is altogether abandoned in analog neuromorphic systems or other unconventional hardware bases, it can become difficult to relate timescale conditions to the physical hardware dynamics. Here we explore the relations between task-defined timescale conditions and physical hardware timescales in some depth. The article has two main parts. In the first part we develop an abstract model of a generic computational system that admits a unified discussion of computational timescales. This model is general enough to cover digital computing systems as well as analog neuromorphic or other unconventional ones. We identify four major types of timescales which require separate considerations: causal physical timescales; timescales of phenomenal change which characterize the ``speed'' of how something changes in time; timescales of reactivity which describe how fast a computing system can react to incoming trigger information; and memory timescales. In the second part we survey twenty known computational mechanisms that can be used to obtain desired task-related timescale characteristics from the physical givens of the available hardware.

preprint2022arXiv

A New Look at Spike-Timing-Dependent Plasticity Networks for Spatio-Temporal Feature Learning

We present new theoretical foundations for unsupervised Spike-Timing-Dependent Plasticity (STDP) learning in spiking neural networks (SNNs). In contrast to empirical parameter search used in most previous works, we provide novel theoretical grounds for SNN and STDP parameter tuning which considerably reduces design time. Using our generic framework, we propose a class of global, action-based and convolutional SNN-STDP architectures for learning spatio-temporal features from event-based cameras. We assess our methods on the N-MNIST, the CIFAR10-DVS and the IBM DVS128 Gesture datasets, all acquired with a real-world event camera. Using our framework, we report significant improvements in classification accuracy compared to both conventional state-of-the-art event-based feature descriptors (+8.2% on CIFAR10-DVS), and compared to state-of-the-art STDP-based systems (+9.3% on N-MNIST, +7.74% on IBM DVS128 Gesture). Our work contributes to both ultra-low-power learning in neuromorphic edge devices, and towards a biologically-plausible, optimization-based theory of cortical vision.

preprint2022arXiv

Learning to SLAM on the Fly in Unknown Environments: A Continual Learning Approach for Drones in Visually Ambiguous Scenes

Learning to safely navigate in unknown environments is an important task for autonomous drones used in surveillance and rescue operations. In recent years, a number of learning-based Simultaneous Localisation and Mapping (SLAM) systems relying on deep neural networks (DNNs) have been proposed for applications where conventional feature descriptors do not perform well. However, such learning-based SLAM systems rely on DNN feature encoders trained offline in typical deep learning settings. This makes them less suited for drones deployed in environments unseen during training, where continual adaptation is paramount. In this paper, we present a new method for learning to SLAM on the fly in unknown environments, by modulating a low-complexity Dictionary Learning and Sparse Coding (DLSC) pipeline with a newly proposed Quadratic Bayesian Surprise (QBS) factor. We experimentally validate our approach with data collected by a drone in a challenging warehouse scenario, where the high number of ambiguous scenes makes visual disambiguation hard.

preprint2022arXiv

VWR2A: A Very-Wide-Register Reconfigurable-Array Architecture for Low-Power Embedded Devices

Edge-computing requires high-performance energy-efficient embedded systems. Fixed-function or custom accelerators, such as FFT or FIR filter engines, are very efficient at implementing a particular functionality for a given set of constraints. However, they are inflexible when facing application-wide optimizations or functionality upgrades. Conversely, programmable cores offer higher flexibility, but often with a penalty in area, performance, and, above all, energy consumption. In this paper, we propose VWR2A, an architecture that integrates high computational density and low power memory structures (i.e., very-wide registers and scratchpad memories). VWR2A narrows the energy gap with similar or better performance on FFT kernels with respect to an FFT accelerator. Moreover, VWR2A flexibility allows to accelerate multiple kernels, resulting in significant energy savings at the application level.

preprint2021arXiv

MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration

Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latency interconnect. MemPool's baseline 2D design is severely limited by routing congestion and wire propagation delay, making the design ideal for 3D integration. In architectural terms, we increase MemPool's scratchpad memory capacity beyond the sweet spot for 2D designs, improving performance in a common digital signal processing kernel. We propose a 3D MemPool design that leverages a smart partitioning of the memory resources across two layers to balance the size and utilization of the stacked dies. In this paper, we explore the architectural and the technology parameter spaces by analyzing the power, performance, area, and energy efficiency of MemPool instances in 2D and 3D with 1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28 nm technology node. We observe a performance gain of 9.1% when running a matrix multiplication on the MemPool-3D design with 4 MiB of scratchpad memory compared to the MemPool 2D counterpart. In terms of energy efficiency, we can implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget 15% smaller than its 2D counterpart, and even 3.7% smaller than the MemPool-2D instance with one-fourth of the L1 scratchpad memory capacity.

preprint2020arXiv

Enabling Resource-Aware Mapping of Spiking Neural Networks via Spatial Decomposition

With growing model complexity, mapping Spiking Neural Network (SNN)-based applications to tile-based neuromorphic hardware is becoming increasingly challenging. This is because the synaptic storage resources on a tile, viz. a crossbar, can accommodate only a fixed number of pre-synaptic connections per post-synaptic neuron. For complex SNN models that have many pre-synaptic connections per neuron, some connections may need to be pruned after training to fit onto the tile resources, leading to a loss in model quality, e.g., accuracy. In this work, we propose a novel unrolling technique that decomposes a neuron function with many pre-synaptic connections into a sequence of homogeneous neural units, where each neural unit is a function computation node, with two pre-synaptic connections. This spatial decomposition technique significantly improves crossbar utilization and retains all pre-synaptic connections, resulting in no loss of the model quality derived from connection pruning. We integrate the proposed technique within an existing SNN mapping framework and evaluate it using machine learning applications on the DYNAP-SE state-of-the-art neuromorphic hardware. Our results demonstrate an average 60% lower crossbar requirement, 9x higher synapse utilization, 62% lower wasted energy on the hardware, and between 0.8% and 4.6% increase in model quality.

preprint2020arXiv

PyCARL: A PyNN Interface for Hardware-Software Co-Simulation of Spiking Neural Network

We present PyCARL, a PyNN-based common Python programming interface for hardware-software co-simulation of spiking neural network (SNN). Through PyCARL, we make the following two key contributions. First, we provide an interface of PyNN to CARLsim, a computationally-efficient, GPU-accelerated and biophysically-detailed SNN simulator. PyCARL facilitates joint development of machine learning models and code sharing between CARLsim and PyNN users, promoting an integrated and larger neuromorphic community. Second, we integrate cycle-accurate models of state-of-the-art neuromorphic hardware such as TrueNorth, Loihi, and DynapSE in PyCARL, to accurately model hardware latencies that delay spikes between communicating neurons and degrade performance. PyCARL allows users to analyze and optimize the performance difference between software-only simulation and hardware-software co-simulation of their machine learning models. We show that system designers can also use PyCARL to perform design-space exploration early in the product development stage, facilitating faster time-to-deployment of neuromorphic products. We evaluate the memory usage and simulation time of PyCARL using functionality tests, synthetic SNNs, and realistic applications. Our results demonstrate that for large SNNs, PyCARL does not lead to any significant overhead compared to CARLsim. We also use PyCARL to analyze these SNNs for a state-of-the-art neuromorphic hardware and demonstrate a significant performance deviation from software-only simulations. PyCARL allows to evaluate and minimize such differences early during model development.

preprint2020arXiv

Run-time Mapping of Spiking Neural Networks to Neuromorphic Hardware

In this paper, we propose a design methodology to partition and map the neurons and synapses of online learning SNN-based applications to neuromorphic architectures at {run-time}. Our design methodology operates in two steps -- step 1 is a layer-wise greedy approach to partition SNNs into clusters of neurons and synapses incorporating the constraints of the neuromorphic architecture, and step 2 is a hill-climbing optimization algorithm that minimizes the total spikes communicated between clusters, improving energy consumption on the shared interconnect of the architecture. We conduct experiments to evaluate the feasibility of our algorithm using synthetic and realistic SNN-based applications. We demonstrate that our algorithm reduces SNN mapping time by an average 780x compared to a state-of-the-art design-time based SNN partitioning approach with only 6.25\% lower solution quality.

preprint2014arXiv

Worst-case Throughput Analysis for Parametric Rate and Parametric Actor Execution Time Scenario-Aware Dataflow Graphs

Scenario-aware dataflow (SADF) is a prominent tool for modeling and analysis of dynamic embedded dataflow applications. In SADF the application is represented as a finite collection of synchronous dataflow (SDF) graphs, each of which represents one possible application behaviour or scenario. A finite state machine (FSM) specifies the possible orders of scenario occurrences. The SADF model renders the tightest possible performance guarantees, but is limited by its finiteness. This means that from a practical point of view, it can only handle dynamic dataflow applications that are characterized by a reasonably sized set of possible behaviours or scenarios. In this paper we remove this limitation for a class of SADF graphs by means of SADF model parametrization in terms of graph port rates and actor execution times. First, we formally define the semantics of the model relevant for throughput analysis based on (max,+) linear system theory and (max,+) automata. Second, by generalizing some of the existing results, we give the algorithms for worst-case throughput analysis of parametric rate and parametric actor execution time acyclic SADF graphs with a fully connected, possibly infinite state transition system. Third, we demonstrate our approach on a few realistic applications from digital signal processing (DSP) domain mapped onto an embedded multi-processor architecture.

preprint2007arXiv

A Hybrid Prefetch Scheduling Heuristic to Minimize at Run-Time the Reconfiguration Overhead of Dynamically Reconfigurable Hardware

Due to the emergence of highly dynamic multimedia applications there is a need for flexible platforms and run-time scheduling support for embedded systems. Dynamic Reconfigurable Hardware (DRHW) is a promising candidate to provide this flexibility but, currently, not sufficient run-time scheduling support to deal with the run-time reconfigurations exists. Moreover, executing at run-time a complex scheduling heuristic to provide this support may generate an excessive run-time penalty. Hence, we have developed a hybrid design/run-time prefetch heuristic that schedules the reconfigurations at run-time, but carries out the scheduling computations at design-time by carefully identifying a set of near-optimal schedules that can be selected at run-time. This approach provides run-time flexibility with a negligible penalty.

preprint2007arXiv

A Memory Hierarchical Layer Assigning and Prefetching Technique to Overcome the Memory Performance/Energy Bottleneck

The memory subsystem has always been a bottleneck in performance as well as significant power contributor in memory intensive applications. Many researchers have presented multi-layered memory hierarchies as a means to design energy and performance efficient systems. However, most of the previous work do not explore trade-offs systematically. We fill this gap by proposing a formalized technique that takes into consideration data reuse, limited lifetime of the arrays of an application and application specific prefetching opportunities, and performs a thorough trade-off exploration for different memory layer sizes. This technique has been implemented on a prototype tool, which was tested successfully using nine real-life applications of industrial relevance. Following this approach we have able to reduce execution time up to 60%, and energy consumption up to 70%.

preprint2007arXiv

Energy Efficiency of the IEEE 802.15.4 Standard in Dense Wireless Microsensor Networks: Modeling and Improvement Perspectives

Wireless microsensor networks, which have been the topic of intensive research in recent years, are now emerging in industrial applications. An important milestone in this transition has been the release of the IEEE 802.15.4 standard that specifies interoperable wireless physical and medium access control layers targeted to sensor node radios. In this paper, we evaluate the potential of an 802.15.4 radio for use in an ultra low power sensor node operating in a dense network. Starting from measurements carried out on the off-the-shelf radio, effective radio activation and link adaptation policies are derived. It is shown that, in a typical sensor network scenario, the average power per node can be reduced down to 211m mm mW. Next, the energy consumption breakdown between the different phases of a packet transmission is presented, indicating which part of the transceiver architecture can most effectively be optimized in order to further reduce the radio power, enabling self-powered wireless microsensor networks.

preprint2007arXiv

Functional Equivalence Checking for Verification of Algebraic Transformations on Array-Intensive Source Code

Development of energy and performance-efficient embedded software is increasingly relying on application of complex transformations on the critical parts of the source code. Designers applying such nontrivial source code transformations are often faced with the problem of ensuring functional equivalence of the original and transformed programs. Currently they have to rely on incomplete and time-consuming simulation. Formal automatic verification of the transformed program against the original is instead desirable. This calls for equivalence checking tools similar to the ones available for comparing digital circuits. We present such a tool to compare array-intensive programs related through a combination of important global transformations like expression propagations, loop and algebraic transformations. When the transformed program fails to pass the equivalence check, the tool provides specific feedback on the possible locations of errors.

Francky Catthoor

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Architectural Classification of XR Workloads: Cross-Layer Archetypes and Implications

Investigating methods to improve photovoltaic thermal models at second-to-minute timescales

Timescales: the choreography of classical and unconventional computing

A New Look at Spike-Timing-Dependent Plasticity Networks for Spatio-Temporal Feature Learning

Learning to SLAM on the Fly in Unknown Environments: A Continual Learning Approach for Drones in Visually Ambiguous Scenes

VWR2A: A Very-Wide-Register Reconfigurable-Array Architecture for Low-Power Embedded Devices

MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration

Enabling Resource-Aware Mapping of Spiking Neural Networks via Spatial Decomposition

PyCARL: A PyNN Interface for Hardware-Software Co-Simulation of Spiking Neural Network

Run-time Mapping of Spiking Neural Networks to Neuromorphic Hardware

Worst-case Throughput Analysis for Parametric Rate and Parametric Actor Execution Time Scenario-Aware Dataflow Graphs

A Hybrid Prefetch Scheduling Heuristic to Minimize at Run-Time the Reconfiguration Overhead of Dynamically Reconfigurable Hardware

A Memory Hierarchical Layer Assigning and Prefetching Technique to Overcome the Memory Performance/Energy Bottleneck

Energy Efficiency of the IEEE 802.15.4 Standard in Dense Wireless Microsensor Networks: Modeling and Improvement Perspectives

Functional Equivalence Checking for Verification of Algebraic Transformations on Array-Intensive Source Code