Source author record

Luigi Raffo

Luigi Raffo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture eess.SP

Catalog footprint

What is connected

3works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Run-time Performance Monitoring of Heterogenous Hw/Sw Platforms Using PAPI

In the era of Cyber Physical Systems, designers need to offer support for run-time adaptivity considering different constraints, including the internal status of the system. This work presents a run-time monitoring approach, based on the Performance Application Programming Interface, that offers a unified interface to transparently access both the standard Performance Monitoring Counters (PMCs) in the CPUs and the custom ones integrated into hardware accelerators. Automatic tools offer to Sw programmers the support to design and implement Coarse-Grain Virtual Reconfigurable Circuits, instrumented with custom PMCs. This approach has been validated on a heterogeneous application for image/video processing with an overhead of 6% of the execution time.

preprint2021arXiv

The Multi-Dataflow Composer Tool: an open-source tool suite for Optimized Coarse-Grain Reconfigurable Hardware Accelerators and Platform Design

Modern embedded and cyber-physical systems require every day more performance, power efficiency and flexibility, to execute several profiles and functionalities targeting the ever growing adaptivity needs and preserving execution efficiency. Such requirements pushed designers towards the adoption of heterogeneous and reconfigurable substrates, which development and management is not that straightforward. Despite acceleration and flexibility are desirable in many domains, the barrier of hardware deployment and operation is still there since specific advanced expertise and skills are needed. Related challenges are effectively tackled by leveraging on automation strategies that in some cases, as in the proposed work, exploit model-based approaches. This paper is focused on the Multi-Dataflow Composer (MDC) tool, that intends to solve issues related to design, optimization and operation of coarse-grain reconfigurable hardware accelerators and their easy adoption in modern heterogeneous substrates. MDC latest features and improvements are introduced in detail and have been assessed on the so far unexplored robotics application field. A multi-profile trajectory generator for a robotic arm is implemented over a Xilinx FPGA board to show in which cases coarse-grain reconfiguration can be applied and which can be the parameters and trade-offs MDC will allow users to play with.

preprint2020arXiv

Optimizing Temporal Convolutional Network inference on FPGA-based accelerators

Convolutional Neural Networks are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition, and segmentation. Recent research results demonstrate that multilayer(deep) networks involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as well as in tasks involving sequence modelling. These structures, commonly referred to as Temporal Convolutional Networks (TCNs), have been demonstrated to consistently outperform Recurrent Neural Networks in terms of accuracy and training time [1]. While FPGA-based inference accelerators for classic CNNs are widespread, literature is lacking in a quantitative evaluation of their usability on inference for TCN models. In this paper we present such an evaluation, considering a CNN accelerator with specific features supporting TCN kernels as a reference and a set of state-of-the-art TCNs as a benchmark. Experimental results show that, during TCN execution, operational intensity can be critical for the overall performance. We propose a convolution scheduling based on batch processing that can boost efficiency up to 96% of theoretical peak performance. Overall we can achieve up to 111,8 GOPS/s and power efficiency of 33,9 GOPS/s/W on an Ultrascale+ ZU3EG (up to 10x speedup and 3x power efficiency improvement with respect to pure software implementation).