Source author record

Diego Didona

Diego Didona appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Performance Distributed, Parallel, and Cluster Computing Databases Machine Learning

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Lynceus: Cost-efficient Tuning and Provisioning of Data Analytic Jobs

Modern data analytic and machine learning jobs find in the cloud a natural deployment platform to satisfy their notoriously large resource requirements. Yet, to achieve cost efficiency, it is crucial to identify a deployment configuration that satisfies user-defined QoS constraints (e.g., on execution time), while avoiding unnecessary over-provisioning. This paper introduces Lynceus, a new approach for the optimization of cloud based data analytic jobs that improves overstate-of-the-art approaches by enabling significant cost savings both in terms of the final recommended configuration and of the optimization process used to recommend configurations. Unlike existing solutions, Lynceus optimizes in a joint fashion both the cloud-related and the application-level parameters. This allows for a reduction of the cost of recommended configurations by up to 3.7x at the 90-th percentile with respect to existing approaches, which treat the optimization of cloud-related and application-level parameters as two independent problems. Further, Lynceus reduces the cost of the optimization process (i.e., the cloud cost incurred for testing configurations) by up to 11x. Such an improvement is achieved thanks to two mechanisms: i) a timeout approach which allows to abort the exploration of configurations that are deemed suboptimal, while still extracting useful information to guide future explorations and to improve its predictive model - differently from recent works, which either incur the full cost for testing suboptimal configurations or are unable to extract any knowledge from aborted runs; ii) a long-sighted and budget-aware technique that determines which configurations to test by predicting the long-term impact of each exploration - unlike state-of-the-art approaches for the optimization of cloud jobs, which adopt greedy optimization methods.

preprint2020arXiv

Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs

Solid-state drives (SSDs) are extensively used to deploy persistent data stores, as they provide low latency random access, high write throughput, high data density, and low cost. Tree-based data structures are widely used to build persistent data stores, and indeed they lie at the backbone of many of the data management systems used in production and research today. In this paper, we show that benchmarking a persistent tree-based data structure on an SSD is a complex process, which may easily incur subtle pitfalls that can lead to an inaccurate performance assessment. At a high-level, these pitfalls stem from the interaction of complex software running on complex hardware. On one hand, tree structures implement internal operations that have nontrivial effects on performance. On the other hand, SSDs employ firmware logic to deal with the idiosyncrasies of the underlying flash memory, which are well known to lead to complex performance dynamics. We identify seven benchmarking pitfalls using RocksDB and WiredTiger, two widespread implementations of an LSM-Tree and a B+Tree, respectively. We show that such pitfalls can lead to incorrect measurements of key performance indicators, hinder the reproducibility and the representativeness of the results, and lead to suboptimal deployments in production environments. We also provide guidelines on how to avoid these pitfalls to obtain more reliable performance measurements, and to perform more thorough and fair comparison among different design points.

preprint2014arXiv

A Flexible Framework for Accurate Simulation of Cloud In-Memory Data Stores

In-memory (transactional) data stores are recognized as a first-class data management technology for cloud platforms, thanks to their ability to match the elasticity requirements imposed by the pay-as-you-go cost model. On the other hand, defining the well-suited amount of cache servers to be deployed, and the degree of in-memory replication of slices of data, in order to optimize reliability/availability and performance tradeoffs, is far from being a trivial task. Yet, it is an essential aspect of the provisioning process of cloud platforms, given that it has an impact on how well cloud resources are actually exploited. To cope with the issue of determining optimized configurations of cloud in-memory data stores, in this article we present a flexible simulation framework offering skeleton simulation models that can be easily specialized in order to capture the dynamics of diverse data grid systems, such as those related to the specific protocol used to provide data consistency and/or transactional guarantees. Besides its flexibility, another peculiar aspect of the framework lies in that it integrates simulation and machine-learning (black-box) techniques, the latter being essentially used to capture the dynamics of the data-exchange layer (e.g. the message passing layer) across the cache servers. This is a relevant aspect when considering that the actual data-transport/networking infrastructure on top of which the data grid is deployed might be unknown, hence being not feasible to be modeled via white-box (namely purely simulative) approaches. We also provide an extended experimental study aimed at validating instances of simulation models supported by our framework against execution dynamics of real data grid systems deployed on top of either private or public cloud infrastructures.

preprint2014arXiv

On Bootstrapping Machine Learning Performance Predictors via Analytical Models

Performance modeling typically relies on two antithetic methodologies: white box models, which exploit knowledge on system's internals and capture its dynamics using analytical approaches, and black box techniques, which infer relations among the input and output variables of a system based on the evidences gathered during an initial training phase. In this paper we investigate a technique, which we name Bootstrapping, which aims at reconciling these two methodologies and at compensating the cons of the one with the pros of the other. We thoroughly analyze the design space of this gray box modeling technique, and identify a number of algorithmic and parametric trade-offs which we evaluate via two realistic case studies, a Key-Value Store and a Total Order Broadcast service.

Diego Didona

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Lynceus: Cost-efficient Tuning and Provisioning of Data Analytic Jobs

Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs

A Flexible Framework for Accurate Simulation of Cloud In-Memory Data Stores

On Bootstrapping Machine Learning Performance Predictors via Analytical Models