Source author record

Herodotos Herodotou

Herodotos Herodotou appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Databases

Catalog footprint

What is connected

5works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Hihooi: A Database Replication Middleware for Scaling Transactional Databases Consistently

With the advent of the Internet and Internet-connected devices, modern business applications can experience rapid increases as well as variability in transactional workloads. Database replication has been employed to scale performance and improve availability of relational databases but past approaches have suffered from various issues including limited scalability, performance versus consistency tradeoffs, and requirements for database or application modifications. This paper presents Hihooi, a replication-based middleware system that is able to achieve workload scalability, strong consistency guarantees, and elasticity for existing transactional databases at a low cost. A novel replication algorithm enables Hihooi to propagate database modifications asynchronously to all replicas at high speeds, while ensuring that all replicas are consistent. At the same time, a fine-grained routing algorithm is used to load balance incoming transactions to available replicas in a consistent way. Our thorough experimental evaluation with several well-established benchmarks shows how Hihooi is able to achieve almost linear workload scalability for transactional databases.

preprint2020arXiv

S2CE: A Hybrid Cloud and Edge Orchestrator for Mining Exascale Distributed Streams

The explosive increase in volume, velocity, variety, and veracity of data generated by distributed and heterogeneous nodes such as IoT and other devices, continuously challenge the state of art in big data processing platforms and mining techniques. Consequently, it reveals an urgent need to address the ever-growing gap between this expected exascale data generation and the extraction of insights from these data. To address this need, this paper proposes Stream to Cloud & Edge (S2CE), a first of its kind, optimized, multi-cloud and edge orchestrator, easily configurable, scalable, and extensible. S2CE will enable machine and deep learning over voluminous and heterogeneous data streams running on hybrid cloud and edge settings, while offering the necessary functionalities for practical and scalable processing: data fusion and preprocessing, sampling and synthetic stream generation, cloud and edge smart resource management, and distributed processing.

preprint2019arXiv

Automating Distributed Tiered Storage Management in Cluster Computing

Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed file systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them while trying to optimize their workloads. In this paper, we develop a general framework for automatically moving data across the available storage tiers in distributed file systems. Moreover, we employ machine learning for tracking and predicting file access patterns, which we use to decide when and which data to move up or down the storage tiers for increasing system performance. Our approach uses incremental learning to dynamically refine the models with new file accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases significant benefits in terms of both workload performance and cluster efficiency.

preprint2012arXiv

Stubby: A Transformation-based Optimizer for MapReduce Workflows

There is a growing trend of performing analysis on large datasets using workflows composed of MapReduce jobs connected through producer-consumer relationships based on data. This trend has spurred the development of a number of interfaces--ranging from program-based to query-based interfaces--for generating MapReduce workflows. Studies have shown that the gap in performance can be quite large between optimized and unoptimized workflows. However, automatic cost-based optimization of MapReduce workflows remains a challenge due to the multitude of interfaces, large size of the execution plan space, and the frequent unavailability of all types of information needed for optimization. We introduce a comprehensive plan space for MapReduce workflows generated by popular workflow generators. We then propose Stubby, a cost-based optimizer that searches selectively through the subspace of the full plan space that can be enumerated correctly and costed based on the information available in any given setting. Stubby enumerates the plan space based on plan-to-plan transformations and an efficient search algorithm. Stubby is designed to be extensible to new interfaces and new types of optimizations, which is a desirable feature given how rapidly MapReduce systems are evolving. Stubby's efficiency and effectiveness have been evaluated using representative workflows from many domains.

preprint2011arXiv

Hadoop Performance Models

Hadoop MapReduce is now a popular choice for performing large-scale data analytics. This technical report describes a detailed set of mathematical performance models for describing the execution of a MapReduce job on Hadoop. The models describe dataflow and cost information at the fine granularity of phases within the map and reduce tasks of a job execution. The models can be used to estimate the performance of MapReduce jobs as well as to find the optimal configuration settings to use when running the jobs.