Source author record

Renato L. F. Cunha

Renato L. F. Cunha appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

4works

1topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Helping HPC Users Specify Job Memory Requirements via Machine Learning

Resource allocation in High Performance Computing (HPC) settings is still not easy for end-users due to the wide variety of application and environment configuration options. Users have difficulties to estimate the number of processors and amount of memory required by their jobs, select the queue and partition, and estimate when job output will be available to plan for next experiments. Apart from wasting infrastructure resources by making wrong allocation decisions, overall user response time can also be negatively impacted. Techniques that exploit batch scheduler systems to predict waiting time and runtime of user jobs have already been proposed. However, we observed that such techniques are not suitable for predicting job memory usage. In this paper we introduce a tool to help users predict their memory requirements using machine learning. We describe the integration of the tool with a batch scheduler system, discuss how batch scheduler log data can be exploited to generate memory usage predictions through machine learning, and present results of two production systems containing thousands of jobs.

preprint2016arXiv

Job Placement Advisor Based on Turnaround Predictions for HPC Hybrid Clouds

Several companies and research institutes are moving their CPU-intensive applications to hybrid High Performance Computing (HPC) cloud environments. Such a shift depends on the creation of software systems that help users decide where a job should be placed considering execution time and queue wait time to access on-premise clusters. Relying blindly on turnaround prediction techniques will affect negatively response times inside HPC cloud environments. This paper introduces a tool to make job placement decisions in HPC hybrid cloud environments taking into account the inaccuracy of execution and waiting time predictions. We used job traces from real supercomputing centers to run our experiments, and compared the performance between environments using real speedup curves. We also extended a state-of-the-art machine learning based predictor to work with data from the cluster scheduler. Our main findings are: (i) depending on workload characteristics, there is a turning point where predictions should be disregarded in favor of a more conservative decision to minimize job turnaround times and (ii) scheduler data plays a key role in improving predictions generated with machine learning using job trace data---our experiments showed around 20% prediction accuracy improvements.

preprint2016arXiv

SLA-aware Interactive Workflow Assistant for HPC Parameter Sweeping Experiments

A common workflow in science and engineering is to (i) setup and deploy large experiments with tasks comprising an application and multiple parameter values; (ii) generate intermediate results; (iii) analyze them; and (iv) reprioritize the tasks. These steps are repeated until the desired goal is achieved, which can be the evaluation/simulation of complex systems or model calibration. Due to time and cost constraints, sweeping all possible parameter values of the user application is not always feasible. Experimental Design techniques can help users reorganize submission-execution-analysis workflows to bring a solution in a more timely manner. This paper introduces a novel tool that leverages users' feedback on analyzing intermediate results of parameter sweeping experiments to advise them about their strategies on parameter selections tied to their SLA constraints. We evaluated our tool with three applications of distinct domains and search space shapes. Our main finding is that users with submission-execution-analysis workflows can benefit from their interaction with intermediate results and adapt themselves according to their domain expertise and SLA constraints.

preprint2013arXiv

Patience-aware Scheduling for Cloud Services: Freeing Users from the Chains of Boredom

Scheduling of service requests in Cloud computing has traditionally focused on the reduction of pre-service wait, generally termed as waiting time. Under certain conditions such as peak load, however, it is not always possible to give reasonable response times to all users. This work explores the fact that different users may have their own levels of tolerance or patience with response delays. We introduce scheduling strategies that produce better assignment plans by prioritising requests from users who expect to receive the results earlier and by postponing servicing jobs from those who are more tolerant to response delays. Our analytical results show that the behaviour of users' patience plays a key role in the evaluation of scheduling techniques, and our computational evaluation demonstrates that, under peak load, the new algorithms typically provide better user experience than the traditional FIFO strategy.

Renato L. F. Cunha

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Helping HPC Users Specify Job Memory Requirements via Machine Learning

Job Placement Advisor Based on Turnaround Predictions for HPC Hybrid Clouds

SLA-aware Interactive Workflow Assistant for HPC Parameter Sweeping Experiments

Patience-aware Scheduling for Cloud Services: Freeing Users from the Chains of Boredom