Researcher profile

Francesco Conti

Francesco Conti contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2026arXiv

An Algebraic Representation Theorem for Linear GENEOs in Geometric Machine Learning

Geometric and Topological Deep Learning are rapidly growing research areas that enhance machine learning through the use of geometric and topological structures. Within this framework, Group Equivariant Non-Expansive Operators (GENEOs) have emerged as a powerful class of operators for encoding symmetries and designing efficient, interpretable neural architectures. Originally introduced in Topological Data Analysis, GENEOs have since found applications in Deep Learning as tools for constructing equivariant models with reduced parameter complexity. GENEOs provide a unifying framework bridging Geometric and Topological Deep Learning and include the operator computing persistence diagrams as a special case. Their theoretical foundations rely on group actions, equivariance, and compactness properties of operator spaces, grounding them in algebra and geometry while enabling both mathematical rigor and practical relevance. While a previous representation theorem characterized linear GENEOs acting on data of the same type, many real-world applications require operators between heterogeneous data spaces. In this work, we address this limitation by introducing a new representation theorem for linear GENEOs acting between different perception pairs, based on generalized T-permutant measures. Under mild assumptions on the data domains and group actions, our result provides a complete characterization of such operators. We also prove the compactness and convexity of the space of linear GENEOs. We further demonstrate the practical impact of this theory by applying the proposed framework to improve the performance of autoencoders, highlighting the relevance of GENEOs in modern machine learning applications.

preprint2026arXiv

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

preprint2026arXiv

TopoFisher: Learning Topological Summary Statistics by Maximizing Fisher Information

Persistence diagrams provide stable, interpretable summaries of geometric and topological structure and are useful for simulation-based inference when low-order statistics miss key information. Yet persistence-based pipelines require hand-chosen filtrations, vectorizations, and compressors, typically without an objective tied to parameter uncertainty. We introduce \textbf{TopoFisher}, a differentiable persistent-homology pipeline that learns topological summaries by maximizing local Gaussian Fisher information. Using simulations near a fiducial parameter, TopoFisher optimizes trainable filtrations, diagram vectorizations, and compressors without posterior samples or supervised regression targets, while retaining stable topological inductive bias. We also give sufficient regularity conditions for the log-determinant Fisher loss to be locally Lipschitz in trainable parameters. Controlled experiments on noisy spirals and Gaussian random fields, where total Fisher information is known, show that TopoFisher recovers much of the available information and outperforms fixed topological vectorizations. Our main results are on weak gravitational lensing, a high-dimensional non-Gaussian cosmological field-inference problem. Learned topological summaries reach higher Fisher information than state-of-the-art cosmological summaries and approach an unconstrained Information Maximising Neural Network baseline with up to $\sim80\times$ fewer parameters. The learned filtrations also generalize better: under simulator shift from lognormal to LPT-based maps it retains most Fisher information, while the neural baseline drops, and in neural posterior estimation they give tighter constraints than the neural baseline, and of state-of-the-art cosmological summaries. These results support Fisher-based topological optimization as a robust, parameter-efficient front end for simulation-based inference.

preprint2022arXiv

A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks

Deployment of modern TinyML tasks on small battery-constrained IoT devices requires high computational energy efficiency. Analog In-Memory Computing (IMC) using non-volatile memory (NVM) promises major efficiency improvements in deep neural network (DNN) inference and serves as on-chip memory storage for DNN weights. However, IMC's functional flexibility limitations and their impact on performance, energy, and area efficiency are not yet fully understood at the system level. To target practical end-to-end IoT applications, IMC arrays must be enclosed in heterogeneous programmable systems, introducing new system-level challenges which we aim at addressing in this work. We present a heterogeneous tightly-coupled clustered architecture integrating 8 RISC-V cores, an in-memory computing accelerator (IMA), and digital accelerators. We benchmark the system on a highly heterogeneous workload such as the Bottleneck layer from a MobileNetV2, showing 11.5x performance and 9.5x energy efficiency improvements, compared to highly optimized parallel execution on the cores. Furthermore, we explore the requirements for end-to-end inference of a full mobile-grade DNN (MobileNetV2) in terms of IMC array resources, by scaling up our heterogeneous architecture to a multi-array accelerator. Our results show that our solution, on the end-to-end inference of the MobileNetV2, is one order of magnitude better in terms of execution latency than existing programmable architectures and two orders of magnitude better than state-of-the-art heterogeneous solutions integrating in-memory computing analog cores.

preprint2022arXiv

GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors

The last few years have seen the emergence of IoT processors: ultra-low power systems-on-chips (SoCs) combining lightweight and flexible micro-controller units (MCUs), often based on open-ISA RISC-V cores, with application-specific accelerators to maximize performance and energy efficiency. Overall, this heterogeneity level requires complex hardware and a full-fledged software stack to orchestrate the execution and exploit platform features. For this reason, enabling agile design space exploration becomes a crucial asset for this new class of low-power SoCs. In this scenario, high-level simulators play an essential role in breaking the speed and design effort bottlenecks of cycle-accurate simulators and FPGA prototypes, respectively, while preserving functional and timing accuracy. We present GVSoC, a highly configurable and timing-accurate event-driven simulator that combines the efficiency of C++ models with the flexibility of Python configuration scripts. GVSoC is fully open-sourced, with the intent to drive future research in the area of highly parallel and heterogeneous RISC-V based IoT processors, leveraging three foundational features: Python-based modular configuration of the hardware description, easy calibration of platform parameters for accurate performance estimation, and high-speed simulation. Experimental results show that GVSoC enables practical functional and performance analysis and design exploration at the full-platform level (processors, memory, peripherals and IOs) with a speed-up of 2500x with respect to cycle-accurate simulation with errors typically below 10% for performance analysis.

preprint2022arXiv

Pruning In Time (PIT): A Lightweight Network Architecture Optimizer for Temporal Convolutional Networks

Temporal Convolutional Networks (TCNs) are promising Deep Learning models for time-series processing tasks. One key feature of TCNs is time-dilated convolution, whose optimization requires extensive experimentation. We propose an automatic dilation optimizer, which tackles the problem as a weight pruning on the time-axis, and learns dilation factors together with weights, in a single training. Our method reduces the model size and inference latency on a real SoC hardware target by up to 7.4x and 3x, respectively with no accuracy drop compared to a network without dilation. It also yields a rich set of Pareto-optimal TCNs starting from a single model, outperforming hand-designed solutions in both size and accuracy.

preprint2022arXiv

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

The fast proliferation of extreme-edge applications using Deep Learning (DL) based algorithms required dedicated hardware to satisfy extreme-edge applications' latency, throughput, and precision requirements. While inference is achievable in practical cases, online finetuning and adaptation of general DL models are still highly challenging. One of the key stumbling stones is the need for parallel floating-point operations, which are considered unaffordable on sub-100 mW extreme-edge SoCs. We tackle this problem with RedMulE (Reduced-precision matrix Multiplication Engine), a parametric low-power hardware accelerator for FP16 matrix multiplications - the main kernel of DL training and inference - conceived for tight integration within a cluster of tiny RISC-V cores based on the PULP (Parallel Ultra-Low-Power) architecture. In 22 nm technology, a 32-FMA RedMulE instance occupies just 0.07 mm^2 (14% of an 8-core RISC-V cluster) and achieves up to 666 MHz maximum operating frequency, for a throughput of 31.6 MAC/cycle (98.8% utilization). We reach a cluster-level power consumption of 43.5 mW and a full-cluster energy efficiency of 688 16-bit GFLOPS/W. Overall, RedMulE features up to 4.65x higher energy efficiency and 22x speedup over SW execution on 8 RISC-V cores.

preprint2022arXiv

Scale up your In-Memory Accelerator: Leveraging Wireless-on-Chip Communication for AIMC-based CNN Inference

Analog In-Memory Computing (AIMC) is emerging as a disruptive paradigm for heterogeneous computing, potentially delivering orders of magnitude better peak performance and efficiency over traditional digital signal processing architectures on Matrix-Vector multiplication. However, to sustain this throughput in real-world applications, AIMC tiles must be supplied with data at very high bandwidth and low latency; this poses an unprecedented pressure on the on-chip communication infrastructure, which becomes the system's performance and efficiency bottleneck. In this context, the performance and plasticity of emerging on-chip wireless communication paradigms provide the required breakthrough to up-scale on-chip communication in large AIMC devices. This work presents a many-tile AIMC architecture with inter-tile wireless communication that integrates multiple heterogeneous computing clusters, embedding a mix of parallel RISC-V cores and AIMC tiles. We perform an extensive design space exploration of the proposed architecture and discuss the benefits of exploiting emerging on-chip communication technologies such as wireless transceivers in the millimeter-wave and terahertz bands.

preprint2022arXiv

SNE: an Energy-Proportional Digital Accelerator for Sparse Event-Based Convolutions

Event-based sensors are drawing increasing attention due to their high temporal resolution, low power consumption, and low bandwidth. To efficiently extract semantically meaningful information from sparse data streams produced by such sensors, we present a 4.5TOP/s/W digital accelerator capable of performing 4-bits-quantized event-based convolutional neural networks (eCNN). Compared to standard convolutional engines, our accelerator performs a number of operations proportional to the number of events contained into the input data stream, ultimately achieving a high energy-to-information processing proportionality. On the IBM-DVS-Gesture dataset, we report 80uJ/inf to 261uJ/inf, respectively, when the input activity is 1.2% and 4.9%. Our accelerator consumes 0.221pJ/SOP, to the best of our knowledge it is the lowest energy/OP reported on a digital neuromorphic engine.

preprint2022arXiv

TCN Mapping Optimization for Ultra-Low Power Time-Series Edge Inference

Temporal Convolutional Networks (TCNs) are emerging lightweight Deep Learning models for Time Series analysis. We introduce an automated exploration approach and a library of optimized kernels to map TCNs on Parallel Ultra-Low Power (PULP) microcontrollers. Our approach minimizes latency and energy by exploiting a layer tiling optimizer to jointly find the tiling dimensions and select among alternative implementations of the causal and dilated 1D-convolution operations at the core of TCNs. We benchmark our approach on a commercial PULP device, achieving up to 103X lower latency and 20.3X lower energy than the Cube-AI toolkit executed on the STM32L4 and from 2.9X to 26.6X lower energy compared to commercial closed-source and academic open-source approaches on the same hardware target.

preprint2022arXiv

Vau da muntanialas: Energy-efficient multi-die scalable acceleration of RNN inference

Recurrent neural networks such as Long Short-Term Memories (LSTMs) learn temporal dependencies by keeping an internal state, making them ideal for time-series problems such as speech recognition. However, the output-to-input feedback creates distinctive memory bandwidth and scalability challenges in designing accelerators for RNNs. We present Muntaniala, an RNN accelerator architecture for LSTM inference with a silicon-measured energy-efficiency of 3.25$TOP/s/W$ and performance of 30.53$GOP/s$ in UMC 65 $nm$ technology. The scalable design of Muntaniala allows running large RNN models by combining multiple tiles in a systolic array. We keep all parameters stationary on every die in the array, drastically reducing the I/O communication to only loading new features and sharing partial results with other dies. For quantifying the overall system power, including I/O power, we built Vau da Muntanialas, to the best of our knowledge, the first demonstration of a systolic multi-chip-on-PCB array of RNN accelerator. Our multi-die prototype performs LSTM inference with 192 hidden states in 330$μs$ with a total system power of 9.0$mW$ at 10$MHz$ consuming 2.95$μJ$. Targeting the 8/16-bit quantization implemented in Muntaniala, we show a phoneme error rate (PER) drop of approximately 3% with respect to floating-point (FP) on a 3L-384NH-123NI LSTM network on the TIMIT dataset.

preprint2021arXiv

A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays

In the last few years, research and development on Deep Learning models and techniques for ultra-low-power devices in a word, TinyML has mainly focused on a train-then-deploy assumption, with static models that cannot be adapted to newly collected data without cloud-based data collection and fine-tuning. Latent Replay-based Continual Learning (CL) techniques[1] enable online, serverless adaptation in principle, but so farthey have still been too computation and memory-hungry for ultra-low-power TinyML devices, which are typically based on microcontrollers. In this work, we introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power (PULP) processor. We rethink the baseline Latent Replay CL algorithm, leveraging quantization of the frozen stage of the model and Latent Replays (LRs) to reduce their memory cost with minimal impact on accuracy. In particular, 8-bit compression of the LR memory proves to be almost lossless (-0.26% with 3000LR) compared to the full-precision baseline implementation, but requires 4x less memory, while 7-bit can also be used with an additional minimal accuracy degradation (up to 5%). We also introduce optimized primitives for forward and backward propagation on the PULP processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory an amount compatible with embedding in TinyML devices. On an advanced 22nm prototype of our platform, called VEGA, the proposed solution performs onaverage 65x faster than a low-power STM32 L4 microcontroller, being 37x more energy efficient enough for a lifetime of 535h when learning a new mini-batch of data once every minute.

preprint2021arXiv

Multiscale Anisotropic Harmonic Filters on non Euclidean domains

This paper introduces Multiscale Anisotropic Harmonic Filters (MAHFs) aimed at extracting signal variations over non-Euclidean domains, namely 2D-Manifolds and their discrete representations, such as meshes and 3D Point Clouds as well as graphs. The topic of pattern analysis is central in image processing and, considered the growing interest in new domains for information representation, the extension of analogous practices on volumetric data is highly demanded. To accomplish this purpose, we define MAHFs as the product of two components, respectively related to a suitable smoothing function, namely the heat kernel derived from the heat diffusion equations, and to local directional information. We analyse the effectiveness of our approach in multi-scale filtering and variation extraction. Finally, we present an application to the surface normal field and to a luminance signal textured to a mesh, aiming to spot, in a separate fashion, relevant curvature changes (support variations) and signal variations.

preprint2020arXiv

Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node

Binary Neural Networks (BNNs) have been shown to be robust to random bit-level noise, making aggressive voltage scaling attractive as a power-saving technique for both logic and SRAMs. In this work, we introduce the first fully programmable IoT end-node system-on-chip (SoC) capable of executing software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory scheme where error-vulnerable SRAMs are complemented by reliable standard-cell memories to safely store critical data under aggressive voltage scaling. On a prototype in 22nm FDX technology, we demonstrate that both the logic and SRAM voltage can be dropped to 0.5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efficiency by 2.2X w.r.t. nominal conditions. Furthermore, we show that the supply voltage can be dropped to 0.42V (50% of nominal) while keeping more than99% of the nominal accuracy (with a bit error rate ~1/1000). In this operating point, our prototype performs 4Gop/s (15.4Inference/s on the CIFAR-10 dataset) by computing up to 13binary ops per pJ, achieving 22.8 Inference/s/mW while keeping within a peak power envelope of 674uW - low enough to enable always-on operation in ultra-low power smart cameras, long-lifetime environmental sensors, and insect-sized pico-drones.

preprint2020arXiv

Enabling Mixed-Precision Quantized Neural Networks in Extreme-Edge Devices

The deployment of Quantized Neural Networks (QNN) on advanced microcontrollers requires optimized software to exploit digital signal processing (DSP) extensions of modern instruction set architectures (ISA). As such, recent research proposed optimized libraries for QNNs (from 8-bit to 2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. The library, composed of 27 kernels, one for each permutation of input feature maps, weights, and output feature maps precision (considering 8-bit, 4-bit and 2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP) clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak performance of 16 MACs/cycle on 8 cores, performing 21x to 25x faster than an STM32H7 (powered by an ARM Cortex M7 processor) with 15x to 21x better energy efficiency.

preprint2020arXiv

Memory-Latency-Accuracy Trade-offs for Continual Learning on a RISC-V Extreme-Edge Node

AI-powered edge devices currently lack the ability to adapt their embedded inference models to the ever-changing environment. To tackle this issue, Continual Learning (CL) strategies aim at incrementally improving the decision capabilities based on newly acquired data. In this work, after quantifying memory and computational requirements of CL algorithms, we define a novel HW/SW extreme-edge platform featuring a low power RISC-V octa-core cluster tailored for on-demand incremental learning over locally sensed data. The presented multi-core HW/SW architecture achieves a peak performance of 2.21 and 1.70 MAC/cycle, respectively, when running forward and backward steps of the gradient descent. We report the trade-off between memory footprint, latency, and accuracy for learning a new class with Latent Replay CL when targeting an image classification task on the CORe50 dataset. For a CL setting that retrains all the layers, taking 5h to learn a new class and achieving up to 77.3% of precision, a more efficient solution retrains only part of the network, reaching an accuracy of 72.5% with a memory requirement of 300 MB and a computation latency of 1.5 hours. On the other side, retraining only the last layer results in the fastest (867 ms) and less memory hungry (20 MB) solution but scoring 58% on the CORe50 dataset. Thanks to the parallelism of the low-power cluster engine, our HW/SW platform results 25x faster than typical MCU device, on which CL is still impractical, and demonstrates an 11x gain in terms of energy consumption with respect to mobile-class solutions.

preprint2020arXiv

Technical Report: NEMO DNN Quantization for Deployment Model

This technical report aims at defining a formal framework for Deep Neural Network (DNN) layer-wise quantization, focusing in particular on the problems related to the final deployment. It also acts as a documentation for the NEMO (NEural Minimization for pytOrch) framework. It describes the four DNN representations used in NEMO (FullPrecision, FakeQuantized, QuantizedDeployable and IntegerDeployable), focusing in particular on a formal definition of the latter two. An important feature of this model, and in particular the IntegerDeployable representation, is that it enables DNN inference using purely integers - without resorting to real-valued numbers in any part of the computation and without relying on an explicit fixed-point numerical representation.