Source author record

Ivo Jimenez

Ivo Jimenez appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Databases

Catalog footprint

What is connected

3works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Skyhook: Towards an Arrow-Native Storage System

With the ever-increasing dataset sizes, several file formats such as Parquet, ORC, and Avro have been developed to store data efficiently, save the network, and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec, the CPU has become the bottleneck trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the system internals. Previous approaches re-implemented functionality of data processing frameworks and access libraries for a particular storage system, a duplication of effort that might have to be repeated for different storage systems. This paper introduces a new design paradigm that allows extending programmable object storage systems to embed existing, widely used data processing frameworks and access libraries into the storage layer with no modifications. In this approach, data processing frameworks and access libraries can evolve independently from storage systems while leveraging distributed storage systems scale-out and availability properties. We present Skyhook, an example implementation of our design paradigm using Ceph, Apache Arrow, and Parquet. We provide a brief performance evaluation of Skyhook and discuss key results.

preprint2014arXiv

Distributed Versioned Object Storage -- Alternatives at the OSD layer (Poster Extended Abstract)

The ability to store multiple versions of a data item is a powerful primitive that has had a wide variety of uses: relational databases, transactional memory, version control systems, to name a few. However, each implementation uses a very particular form of versioning that is customized to the domain in question and hidden away from the user. In our going project, we are reviewing and analyzing multiple uses of versioning in distinct domains, with the goal of identifying the basic components required to provide a generic distributed multiversioning object storage service, and define how these can be customized in order to serve distinct needs. With this primitive, new services can leverage multiversioning to ease development and provide specific consistency guarantees that address particular use cases. This work presents early results that quantify the trade-offs in implementing versioning at the local storage layer.

preprint2013arXiv

RITA: An Index-Tuning Advisor for Replicated Databases

Given a replicated database, a divergent design tunes the indexes in each replica differently in order to specialize it for a specific subset of the workload. This specialization brings significant performance gains compared to the common practice of having the same indexes in all replicas, but requires the development of new tuning tools for database administrators. In this paper we introduce RITA (Replication-aware Index Tuning Advisor), a novel divergent-tuning advisor that offers several essential features not found in existing tools: it generates robust divergent designs that allow the system to adapt gracefully to replica failures; it computes designs that spread the load evenly among specialized replicas, both during normal operation and when replicas fail; it monitors the workload online in order to detect changes that require a recomputation of the divergent design; and, it offers suggestions to elastically reconfigure the system (by adding/removing replicas or adding/dropping indexes) to respond to workload changes. The key technical innovation behind RITA is showing that the problem of selecting an optimal design can be formulated as a Binary Integer Program (BIP). The BIP has a relatively small number of variables, which makes it feasible to solve it efficiently using any off-the-shelf linear-optimization software. Experimental results demonstrate that RITA computes better divergent designs compared to existing tools, offers more features, and has fast execution times.