Source author record

Alexandru Uta

Alexandru Uta appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing cs.CY Performance

Catalog footprint

What is connected

4works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Future Computer Systems and Networking Research in the Netherlands: A Manifesto

Our modern society and competitive economy depend on a strong digital foundation and, in turn, on sustained research and innovation in computer systems and networks (CompSys). With this manifesto, we draw attention to CompSys as a vital part of ICT. Among ICT technologies, CompSys covers all the hardware and all the operational software layers that enable applications; only application-specific details, and often only application-specific algorithms, are not part of CompSys. Each of the Top Sectors of the Dutch Economy, each route in the National Research Agenda, and each of the UN Sustainable Development Goals pose challenges that cannot be addressed without groundbreaking CompSys advances. Looking at the 2030-2035 horizon, important new applications will emerge only when enabled by CompSys developments. Triggered by the COVID-19 pandemic, millions moved abruptly online, raising infrastructure scalability and data sovereignty issues; but governments processing social data and responsible social networks still require a paradigm shift in data sovereignty and sharing. AI already requires massive computer systems which can cost millions per training task, but the current technology leaves an unsustainable energy footprint including large carbon emissions. Computational sciences such as bioinformatics, and "Humanities for all" and "citizen data science", cannot become affordable and efficient until computer systems take a generational leap. Similarly, the emerging quantum internet depends on (traditional) CompSys to bootstrap operation for the foreseeable future. Large commercial sectors, including finance and manufacturing, require specialized computing and networking or risk becoming uncompetitive. And, at the core of Dutch innovation, promising technology hubs, deltas, ports, and smart cities, could see their promise stagger due to critical dependency on non-European technology.

preprint2022arXiv

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

preprint2022arXiv

Skyhook: Towards an Arrow-Native Storage System

With the ever-increasing dataset sizes, several file formats such as Parquet, ORC, and Avro have been developed to store data efficiently, save the network, and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec, the CPU has become the bottleneck trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the system internals. Previous approaches re-implemented functionality of data processing frameworks and access libraries for a particular storage system, a duplication of effort that might have to be repeated for different storage systems. This paper introduces a new design paradigm that allows extending programmable object storage systems to embed existing, widely used data processing frameworks and access libraries into the storage layer with no modifications. In this approach, data processing frameworks and access libraries can evolve independently from storage systems while leveraging distributed storage systems scale-out and availability properties. We present Skyhook, an example implementation of our design paradigm using Ceph, Apache Arrow, and Parquet. We provide a brief performance evaluation of Skyhook and discuss key results.

preprint2020arXiv

In Datacenter Performance, The Only Constant Is Change

All computing infrastructure suffers from performance variability, be it bare-metal or virtualized. This phenomenon originates from many sources: some transient, such as noisy neighbors, and others more permanent but sudden, such as changes or wear in hardware, changes in the underlying hypervisor stack, or even undocumented interactions between the policies of the computing resource provider and the active workloads. Thus, performance measurements obtained on clouds, HPC facilities, and, more generally, datacenter environments are almost guaranteed to exhibit performance regimes that evolve over time, which leads to undesirable nonstationarities in application performance. In this paper, we present our analysis of performance of the bare-metal hardware available on the CloudLab testbed where we focus on quantifying the evolving performance regimes using changepoint detection. We describe our findings, backed by a dataset with nearly 6.9M benchmark results collected from over 1600 machines over a period of 2 years and 9 months. These findings yield a comprehensive characterization of real-world performance variability patterns in one computing facility, a methodology for studying such patterns on other infrastructures, and contribute to a better understanding of performance variability in general.

Alexandru Uta

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Future Computer Systems and Networking Research in the Netherlands: A Manifesto

In-Memory Indexed Caching for Distributed Data Processing

Skyhook: Towards an Arrow-Native Storage System

In Datacenter Performance, The Only Constant Is Change