Source author record

Andrea Bartolini

Andrea Bartolini appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning Hardware Architecture Artificial Intelligence Cryptography and Security eess.SP Performance

Catalog footprint

What is connected

9works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

TitanCFI: Toward Enforcing Control-Flow Integrity in the Root-of-Trust

Modern RISC-V platforms control and monitor security-critical systems such as industrial controllers and autonomous vehicles. While these platforms feature a Root-of-Trust (RoT) to store authentication secrets and enable secure boot technologies, they often lack Control-Flow Integrity (CFI) enforcement and are vulnerable to cyber-attacks which divert the control flow of an application to trigger malicious behaviours. Recent techniques to enforce CFI in RISC-V systems include ISA modifications or custom hardware IPs, all requiring ad-hoc binary toolchains or design of CFI primitives in hardware. This paper proposes TitanCFI, a novel approach to enforce CFI in the RoT. TitanCFI modifies the commit stage of the protected core to stream control flow instructions to the RoT and it integrates the CFI enforcement policy in the RoT firmware. Our approach enables maximum reuse of the hardware resource present in the System-on-Chip (SoC), and it avoids the design of custom IPs and the modification of the compilation toolchain, while exploiting the RoT tamper-proof storage and cryptographic accelerators to secure CFI metadata. We implemented the proposed architecture on a modern RISC-V SoC along with a return address protection policy in the RoT, and benchmarked area and runtime overhead. Experimental results show that TitanCFI achieves overhead comparable to SoA hardware CFI solutions for most benchmarks, with lower area overhead, resulting in 1% of additional area occupation.

preprint2022arXiv

Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers

The new open and royalty-free RISC-V ISA is attracting interest across the whole computing continuum, from microcontrollers to supercomputers. High-performance RISC-V processors and accelerators have been announced, but RISC-V-based HPC systems will need a holistic co-design effort, spanning memory, storage hierarchy interconnects and full software stack. In this paper, we describe Monte Cimone, a fully-operational multi-blade computer prototype and hardware-software test-bed based on U740, a double-precision capable multi-core, 64-bit RISC-V SoC. Monte Cimone does not aim to achieve strong floating-point performance, but it was built with the purpose of "priming the pipe" and exploring the challenges of integrating a multi-node RISC-V cluster capable of providing an HPC production stack including interconnect, storage and power monitoring infrastructure on RISC-V hardware. We present the results of our hardware/software integration effort, which demonstrate a remarkable level of software and hardware readiness and maturity - showing that the first generation of RISC-V HPC machines may not be so far in the future.

preprint2022arXiv

RUAD: unsupervised anomaly detection in HPC systems

The increasing complexity of modern high-performance computing (HPC) systems necessitates the introduction of automated and data-driven methodologies to support system administrators' effort toward increasing the system's availability. Anomaly detection is an integral part of improving the availability as it eases the system administrator's burden and reduces the time between an anomaly and its resolution. However, current state-of-the-art (SoA) approaches to anomaly detection are supervised and semi-supervised, so they require a human-labelled dataset with anomalies - this is often impractical to collect in production HPC systems. Unsupervised anomaly detection approaches based on clustering, aimed at alleviating the need for accurate anomaly data, have so far shown poor performance. In this work, we overcome these limitations by proposing RUAD, a novel Recurrent Unsupervised Anomaly Detection model. RUAD achieves better results than the current semi-supervised and unsupervised SoA approaches. This is achieved by considering temporal dependencies in the data and including long-short term memory cells in the model architecture. The proposed approach is assessed on a complete ten-month history of a Tier-0 system (Marconi100 from CINECA with 980 nodes). RUAD achieves an area under the curve (AUC) of 0.763 in semi-supervised training and an AUC of 0.767 in unsupervised training, which improves upon the SoA approach that achieves an AUC of 0.747 in semi-supervised training and an AUC of 0.734 in unsupervised training. It also vastly outperforms the current SoA unsupervised anomaly detection approach based on clustering, achieving the AUC of 0.548.

preprint2020arXiv

A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.

preprint2020arXiv

pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs/SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs/SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs/SCs in production, and release open dataset and code.

preprint2020arXiv

Toward an End-to-End Auto-tuning Framework in HPC PowerStack

Efficiently utilizing procured power and optimizing performance of scientific applications under power and energy constraints are challenging. The HPC PowerStack defines a software stack to manage power and energy of high-performance computing systems and standardizes the interfaces between different components of the stack. This survey paper presents the findings of a working group focused on the end-to-end tuning of the PowerStack. First, we provide a background on the PowerStack layer-specific tuning efforts in terms of their high-level objectives, the constraints and optimization goals, layer-specific telemetry, and control parameters, and we list the existing software solutions that address those challenges. Second, we propose the PowerStack end-to-end auto-tuning framework, identify the opportunities in co-tuning different layers in the PowerStack, and present specific use cases and solutions. Third, we discuss the research opportunities and challenges for collective auto-tuning of two or more management layers (or domains) in the PowerStack. This paper takes the first steps in identifying and aggregating the important R&D challenges in streamlining the optimization efforts across the layers of the PowerStack.

preprint2019arXiv

Online Anomaly Detection in HPC Systems

Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.

preprint2018arXiv

Anomaly Detection using Autoencoders in High Performance Computing Systems

Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components. The current state of the art for automated anomaly detection employs Machine Learning methods or statistical regression models in a supervised fashion, meaning that the detection tool is trained to distinguish among a fixed set of behaviour classes (healthy and unhealthy states). We propose a novel approach for anomaly detection in High Performance Computing systems based on a Machine (Deep) Learning technique, namely a type of neural network called autoencoder. The key idea is to train a set of autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes and, after training, use them to identify abnormal conditions. This is different from previous approaches which where based on learning the abnormal condition, for which there are much smaller datasets (since it is very hard to identify them to begin with). We test our approach on a real supercomputer equipped with a fine-grained, scalable monitoring infrastructure that can provide large amount of data to characterize the system behaviour. The results are extremely promising: after the training phase to learn the normal system behaviour, our method is capable of detecting anomalies that have never been seen before with a very good accuracy (values ranging between 88% and 96%).

preprint2018arXiv

Pricing Schemes for Energy-Efficient HPC Systems: Design and Exploration

Energy efficiency is of paramount importance for the sustainability of HPC systems. Energy consumption limits the peak performance of supercomputers and accounts for a large share of total cost of ownership. Consequently, system owners and final users have started exploring mechanisms to trade off performance for power consumption, for example through frequency and voltage scaling. However, only a limited number of studies have been devoted to explore the economic viability of performance scaling solutions and to devise pricing mechanisms fostering a more energy-conscious usage of resources, without adversely impacting return-of-investment on the HPC facility. We present a parametrized model to analyze the impact of frequency scaling on energy and to assess the potential total cost benefits for the HPC facility and the user. We evaluate four pricing schemes, considering both facility manager and the user perspectives. We then perform a design space exploration considering current and near-future HPC systems and technologies.

Andrea Bartolini

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

TitanCFI: Toward Enforcing Control-Flow Integrity in the Root-of-Trust

Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers

RUAD: unsupervised anomaly detection in HPC systems

A Machine Learning Approach to Online Fault Classification in HPC Systems

pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

Toward an End-to-End Auto-tuning Framework in HPC PowerStack

Online Anomaly Detection in HPC Systems

Anomaly Detection using Autoencoders in High Performance Computing Systems

Pricing Schemes for Energy-Efficient HPC Systems: Design and Exploration