Researcher profile

Guido Masera

Guido Masera contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks

In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations achieved through post-training quantization are commonly used to curtail the energy consumption of these networks. However, the uniform quantization intervals in FP restrict the bit-width of data structures to large values due to the need to represent most of the numbers with sufficient resolution and avoid high quantization errors. In this paper, we leverage the key insight that (in most of the scenarios) DNN weights and activations are mostly concentrated near zero and only a few of them have large magnitudes. We propose CoNLoCNN, a framework to enable energy-efficient low-precision deep convolutional neural network inference by exploiting: (1) non-uniform quantization of weights enabling simplification of complex multiplication operations; and (2) correlation between activation values enabling partial compensation of quantization errors at low cost without any run-time overheads. To significantly benefit from non-uniform quantization, we also propose a novel data representation format, Encoded Low-Precision Binary Signed Digit, to compress the bit-width of weights while ensuring direct use of the encoded weight for processing using a novel multiply-and-accumulate (MAC) unit design.

preprint2022arXiv

Enabling Capsule Networks at the Edge through Approximate Softmax and Squash Operations

Complex Deep Neural Networks such as Capsule Networks (CapsNets) exhibit high learning capabilities at the cost of compute-intensive operations. To enable their deployment on edge devices, we propose to leverage approximate computing for designing approximate variants of the complex operations like softmax and squash. In our experiments, we evaluate tradeoffs between area, power consumption, and critical path delay of the designs implemented with the ASIC design flow, and the accuracy of the quantized CapsNets, compared to the exact functions.

preprint2022arXiv

LaneSNNs: Spiking Neural Networks for Lane Detection on the Loihi Neuromorphic Processor

Autonomous Driving (AD) related features represent important elements for the next generation of mobile robots and autonomous vehicles focused on increasingly intelligent, autonomous, and interconnected systems. The applications involving the use of these features must provide, by definition, real-time decisions, and this property is key to avoid catastrophic accidents. Moreover, all the decision processes must require low power consumption, to increase the lifetime and autonomy of battery-driven systems. These challenges can be addressed through efficient implementations of Spiking Neural Networks (SNNs) on Neuromorphic Chips and the use of event-based cameras instead of traditional frame-based cameras. In this paper, we present a new SNN-based approach, called LaneSNN, for detecting the lanes marked on the streets using the event-based camera input. We develop four novel SNN models characterized by low complexity and fast response, and train them using an offline supervised learning rule. Afterward, we implement and map the learned SNNs models onto the Intel Loihi Neuromorphic Research Chip. For the loss function, we develop a novel method based on the linear composition of Weighted binary Cross Entropy (WCE) and Mean Squared Error (MSE) measures. Our experimental results show a maximum Intersection over Union (IoU) measure of about 0.62 and very low power consumption of about 1 W. The best IoU is achieved with an SNN implementation that occupies only 36 neurocores on the Loihi processor while providing a low latency of less than 8 ms to recognize an image, thereby enabling real-time performance. The IoU measures provided by our networks are comparable with the state-of-the-art, but at a much low power consumption of 1 W.

preprint2020arXiv

FasTrCaps: An Integrated Framework for Fast yet Accurate Training of Capsule Networks

Recently, Capsule Networks (CapsNets) have shown improved performance compared to the traditional Convolutional Neural Networks (CNNs), by encoding and preserving spatial relationships between the detected features in a better way. This is achieved through the so-called Capsules (i.e., groups of neurons) that encode both the instantiation probability and the spatial information. However, one of the major hurdles in the wide adoption of CapsNets is their gigantic training time, which is primarily due to the relatively higher complexity of their new constituting elements that are different from CNNs. In this paper, we implement different optimizations in the training loop of the CapsNets, and investigate how these optimizations affect their training speed and the accuracy. Towards this, we propose a novel framework FasTrCaps that integrates multiple lightweight optimizations and a novel learning rate policy called WarmAdaBatch (that jointly performs warm restarts and adaptive batch size), and steers them in an appropriate way to provide high training-loop speedup at minimal accuracy loss. We also propose weight sharing for capsule layers. The goal is to reduce the hardware requirements of CapsNets by removing unused/redundant connections and capsules, while keeping high accuracy through tests of different learning rate policies and batch sizes. We demonstrate that one of the solutions generated by the FasTrCaps framework can achieve 58.6% reduction in the training time, while preserving the accuracy (even 0.12% accuracy improvement for the MNIST dataset), compared to the CapsNet by Google Brain. The Pareto-optimal solutions generated by FasTrCaps can be leveraged to realize trade-offs between training time and achieved accuracy. We have open-sourced our framework on https://github.com/Alexei95/FasTrCaps.

preprint2020arXiv

Q-CapsNets: A Specialized Framework for Quantizing Capsule Networks

Capsule Networks (CapsNets), recently proposed by the Google Brain team, have superior learning capabilities in machine learning tasks, like image classification, compared to the traditional CNNs. However, CapsNets require extremely intense computations and are difficult to be deployed in their original form at the resource-constrained edge devices. This paper makes the first attempt to quantize CapsNet models, to enable their efficient edge implementations, by developing a specialized quantization framework for CapsNets. We evaluate our framework for several benchmarks. On a deep CapsNet model for the CIFAR10 dataset, the framework reduces the memory footprint by 6.2x, with only 0.15% accuracy loss. We will open-source our framework at https://git.io/JvDIF in August 2020.

preprint2013arXiv

A joint communication and application simulator for NoC-based SoCs

NoCs have become a widespread paradigm in the system-on-chip design world, not only for multi-purpose SoCs, but also for application-specific ICs. The common approach in the NoC design world is to separate the design of the interconnection from the design of the processing elements: this is well suited for a large number of developments, but the need for joint application and NoC design is not uncommon, especially in the application specific case. The correlation between processing and communication tasks can be strong, and separate or trace-based simulations fall often short of the desired precision. In this work, the OMNET++ based JANoCS simulator is presented: concurrent simulation of processing and communication allow cycle-accurate evaluation of the system. Two cases of study are presented, showing both the need for joint simulations and the effectiveness of JANoCS.

preprint2011arXiv

A Flexible LDPC code decoder with a Network on Chip as underlying interconnect architecture

LDPC (Low Density Parity Check) codes are among the most powerful and widely adopted modern error correcting codes. The iterative decoding algorithms required for these codes involve high computational complexity and high processing throughput is achieved by allocating a sufficient number of processing elements (PEs). Supporting multiple heterogeneous LDPC codes on a parallel decoder poses serious problems in the design of the interconnect structure for such PEs. The aim of this work is to explore the feasibility of NoC (Network on Chip) based decoders, where full flexibility in terms of supported LDPC codes is obtained resorting to an NoC to connect PEs. NoC based LDPC decoders have been previously considered unfeasible because of the cost overhead associated to packet management and routing. On the contrary, the designed NoC adopts a low complexity routing, which introduces a very limited cost overhead with respect to architectures dedicated to specific classes of codes. Moreover the paper proposes an efficient configuration technique, which allows for fast on--the--fly switching among different codes. The decoder architecture is scalable and VLSI synthesis results are presented for several cases of study, including the whole set of WiMAX LDPC codes, WiFi codes and DVB-S2 standard.

preprint2011arXiv

Improving Network-on-Chip-based turbo decoder architectures

In this work novel results concerning Network-on-Chip-based turbo decoder architectures are presented. Stemming from previous publications, this work concentrates first on improving the throughput by exploiting adaptive-bandwidth reduction techniques. This technique shows in the best case an improvement of more than 60 Mb/s. Moreover, it is known that double-binary turbo decoders require higher area than binary ones. This characteristic has the negative effect of increasing the data width of the network nodes. Thus, the second contribution of this work is to reduce the network complexity to support doublebinary codes, by exploiting bit-level and pseudo-floating-point representation of the extrinsic information. These two techniques allow for an area reduction of up to more than the 40% with a performance degradation of about 0.2 dB.

preprint2010arXiv

A Novel VLSI Architecture of Fixed-complexity Sphere Decoder

Fixed-complexity Sphere Decoder (FSD) is a recently proposed technique for Multiple-Input Multiple-Output (MIMO) detection. It has several outstanding features such as constant throughput and large potential parallelism, which makes it suitable for efficient VLSI implementation. However, to our best knowledge, no VLSI implementation of FSD has been reported in the literature, although some FPGA prototypes of FSD with pipeline architecture have been developed. These solutions achieve very high throughput but at very high cost of hardware resources, making them impractical in real applications. In this paper, we present a novel four-nodes-per-cycle parallel architecture of FSD, with a breadth-first processing that allows for short critical path. The implementation achieves a throughput of 213.3 Mbps at 400 MHz clock frequency, at a cost of 0.18 mm2 Silicon area on 0.13μm CMOS technology. The proposed solution is much more economical compared with the existing FPGA implementations, and very suitable for practicl applications because of its balanced performance and hardware-complexity; moreover it has the flexibility to be expanded into an eight-nodes-per-cycle version in order to double the throughput.