Source author record

Adam Barker

Adam Barker appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Artificial Intelligence Software Engineering Performance Computation and Language Computational Engineering, Finance, and Science Operating Systems Databases Information Retrieval Machine Learning physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

31works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Understanding and predicting the performance of big data applications running in the cloud or on-premises could help minimise the overall cost of operations and provide opportunities in efforts to identify performance bottlenecks. The complexity of the low-level internals of big data frameworks and the ubiquity of application and workload configuration parameters makes it challenging and expensive to come up with comprehensive performance modelling solutions. In this paper, instead of focusing on a wide range of configurable parameters, we studied the low-level internals of the MapReduce communication pattern and used a minimal set of performance drivers to develop a set of phase level parametric models for approximating the execution time of a given application on a given cluster. Models can be used to infer the performance of unseen applications and approximate their performance when an arbitrary dataset is used as input. Our approach is validated by running empirical experiments in two setups. On average the error rate in both setups is plus or minus 10% from the measured values.

preprint2016arXiv

A Linked Data Scalability Challenge: Concept Reuse Leads to Semantic Decay

The increasing amount of available Linked Data resources is laying the foundations for more advanced Semantic Web applications. One of their main limitations, however, remains the general low level of data quality. In this paper we focus on a measure of quality which is negatively affected by the increase of the available resources. We propose a measure of semantic richness of Linked Data concepts and we demonstrate our hypothesis that the more a concept is reused, the less semantically rich it becomes. This is a significant scalability issue, as one of the core aspects of Linked Data is the propagation of semantic information on the Web by reusing common terms. We prove our hypothesis with respect to our measure of semantic richness and we validate our model empirically. Finally, we suggest possible future directions to address this scalability problem.

preprint2016arXiv

Cloud Benchmarking For Maximising Performance of Scientific Applications

How can applications be deployed on the cloud to achieve maximum performance? This question is challenging to address with the availability of a wide variety of cloud Virtual Machines (VMs) with different performance capabilities. The research reported in this paper addresses the above question by proposing a six step benchmarking methodology in which a user provides a set of weights that indicate how important memory, local communication, computation and storage related operations are to an application. The user can either provide a set of four abstract weights or eight fine grain weights based on the knowledge of the application. The weights along with benchmarking data collected from the cloud are used to generate a set of two rankings - one based only on the performance of the VMs and the other takes both performance and costs into account. The rankings are validated on three case study applications using two validation techniques. The case studies on a set of experimental VMs highlight that maximum performance can be achieved by the three top ranked VMs and maximum performance in a cost-effective manner is achieved by at least one of the top three ranked VMs produced by the methodology.

preprint2016arXiv

Container-Based Cloud Virtual Machine Benchmarking

With the availability of a wide range of cloud Virtual Machines (VMs) it is difficult to determine which VMs can maximise the performance of an application. Benchmarking is commonly used to this end for capturing the performance of VMs. Most cloud benchmarking techniques are typically heavyweight - time consuming processes which have to benchmark the entire VM in order to obtain accurate benchmark data. Such benchmarks cannot be used in real-time on the cloud and incur extra costs even before an application is deployed. In this paper, we present lightweight cloud benchmarking techniques that execute quickly and can be used in near real-time on the cloud. The exploration of lightweight benchmarking techniques are facilitated by the development of DocLite - Docker Container-based Lightweight Benchmarking. DocLite is built on the Docker container technology which allows a user-defined portion (such as memory size and the number of CPU cores) of the VM to be benchmarked. DocLite operates in two modes, in the first mode, containers are used to benchmark a small portion of the VM to generate performance ranks. In the second mode, historic benchmark data is used along with the first mode as a hybrid to generate VM ranks. The generated ranks are evaluated against three scientific high-performance computing applications. The proposed techniques are up to 91 times faster than a heavyweight technique which benchmarks the entire VM. It is observed that the first mode can generate ranks with over 90% and 86% accuracy for sequential and parallel execution of an application. The hybrid mode improves the correlation slightly but the first mode is sufficient for benchmarking cloud VMs.

preprint2016arXiv

DocLite: A Docker-Based Lightweight Cloud Benchmarking Tool

Existing benchmarking methods are time consuming processes as they typically benchmark the entire Virtual Machine (VM) in order to generate accurate performance data, making them less suitable for real-time analytics. The research in this paper is aimed to surmount the above challenge by presenting DocLite - Docker Container-based Lightweight benchmarking tool. DocLite explores lightweight cloud benchmarking methods for rapidly executing benchmarks in near real-time. DocLite is built on the Docker container technology, which allows a user-defined memory size and number of CPU cores of the VM to be benchmarked. The tool incorporates two benchmarking methods - the first referred to as the native method employs containers to benchmark a small portion of the VM and generate performance ranks, and the second uses historic benchmark data along with the native method as a hybrid to generate VM ranks. The proposed methods are evaluated on three use-cases and are observed to be up to 91 times faster than benchmarking the entire VM. In both methods, small containers provide the same quality of rankings as a large container. The native method generates ranks with over 90% and 86% accuracy for sequential and parallel execution of an application compared against benchmarking the whole VM. The hybrid method did not improve the quality of the rankings significantly.

preprint2016arXiv

Integrating Know-How into the Linked Data Cloud

This paper presents the first framework for integrating procedural knowledge, or "know-how", into the Linked Data Cloud. Know-how available on the Web, such as step-by-step instructions, is largely unstructured and isolated from other sources of online knowledge. To overcome these limitations, we propose extending to procedural knowledge the benefits that Linked Data has already brought to representing, retrieving and reusing declarative knowledge. We describe a framework for representing generic know-how as Linked Data and for automatically acquiring this representation from existing resources on the Web. This system also allows the automatic generation of links between different know-how resources, and between those resources and other online knowledge bases, such as DBpedia. We discuss the results of applying this framework to a real-world scenario and we show how it outperforms existing manual community-driven integration efforts.

preprint2015arXiv

Ad hoc Cloud Computing: From Concept to Realization

This paper presents the first complete, integrated and end-to-end solution for ad hoc cloud computing environments. Ad hoc clouds harvest resources from existing sporadically available, non-exclusive (i.e. primarily used for some other purpose) and unreliable infrastructures. In this paper we discuss the problems ad hoc cloud computing solves and outline our architecture which is based on BOINC.

preprint2015arXiv

Autonomous Fault Detection in Self-Healing Systems using Restricted Boltzmann Machines

Autonomously detecting and recovering from faults is one approach for reducing the operational complexity and costs associated with managing computing environments. We present a novel methodology for autonomously generating investigation leads that help identify systems faults, and extends our previous work in this area by leveraging Restricted Boltzmann Machines (RBMs) and contrastive divergence learning to analyse changes in historical feature data. This allows us to heuristically identify the root cause of a fault, and demonstrate an improvement to the state of the art by showing feature data can be predicted heuristically beyond a single instance to include entire sequences of information.

preprint2015arXiv

Budget Constrained Execution of Multiple Bag-of-Tasks Applications on the Cloud

Optimising the execution of Bag-of-Tasks (BoT) applications on the cloud is a hard problem due to the trade- offs between performance and monetary cost. The problem can be further complicated when multiple BoT applications need to be executed. In this paper, we propose and implement a heuristic algorithm that schedules tasks of multiple applications onto different cloud virtual machines in order to maximise performance while satisfying a given budget constraint. Current approaches are limited in task scheduling since they place a limit on the number of cloud resources that can be employed by the applications. However, in the proposed algorithm there are no such limits, and in comparison with other approaches, the algorithm on average achieves an improved performance of 10%. The experimental results also highlight that the algorithm yields consistent performance even with low budget constraints which cannot be achieved by competing approaches.

preprint2015arXiv

Cloud Services Brokerage: A Survey and Research Roadmap

A Cloud Services Brokerage (CSB) acts as an intermediary between cloud service providers (e.g., Amazon and Google) and cloud service end users, providing a number of value adding services. CSBs as a research topic are in there infancy. The goal of this paper is to provide a concise survey of existing CSB technologies in a variety of areas and highlight a roadmap, which details five future opportunities for research.

preprint2015arXiv

Executing Bag of Distributed Tasks on Virtually Unlimited Cloud Resources

Bag-of-Distributed-Tasks (BoDT) application is the collection of identical and independent tasks each of which requires a piece of input data located around the world. As a result, Cloud computing offers an ef- fective way to execute BoT application as it not only consists of multiple geographically distributed data centres but also allows a user to pay for what she actually uses only. In this paper, BoDT on the Cloud using virtually unlimited cloud resources. A heuristic algorithm is proposed to find an execution plan that takes budget constraints into account. Compared with other approaches, with the same given budget, our algorithm is able to reduce the overall execution time up to 50%.

preprint2015arXiv

Task Scheduling on the Cloud with Hard Constraints

Scheduling Bag-of-Tasks (BoT) applications on the cloud can be more challenging than grid and cluster environ- ments. This is because a user may have a budgetary constraint or a deadline for executing the BoT application in order to keep the overall execution costs low. The research in this paper is motivated to investigate task scheduling on the cloud, given two hard constraints based on a user-defined budget and a deadline. A heuristic algorithm is proposed and implemented to satisfy the hard constraints for executing the BoT application in a cost effective manner. The proposed algorithm is evaluated using four scenarios that are based on the trade-off between performance and the cost of using different cloud resource types. The experimental evaluation confirms the feasibility of the algorithm in satisfying the constraints. The key observation is that multiple resource types can be a better alternative to using a single type of resource.

preprint2014arXiv

A Semantic Web of Know-How: Linked Data for Community-Centric Tasks

This paper proposes a novel framework for representing community know-how on the Semantic Web. Procedural knowledge generated by web communities typically takes the form of natural language instructions or videos and is largely unstructured. The absence of semantic structure impedes the deployment of many useful applications, in particular the ability to discover and integrate know-how automatically. We discuss the characteristics of community know-how and argue that existing knowledge representation frameworks fail to represent it adequately. We present a novel framework for representing the semantic structure of community know-how and demonstrate the feasibility of our approach by providing a concrete implementation which includes a method for automatically acquiring procedural knowledge for real-world tasks.

preprint2014arXiv

Academic Cloud Computing Research: Five Pitfalls and Five Opportunities

This discussion paper argues that there are five fundamental pitfalls, which can restrict academics from conducting cloud computing research at the infrastructure level, which is currently where the vast majority of academic research lies. Instead academics should be conducting higher risk research, in order to gain understanding and open up entirely new areas. We call for a renewed mindset and argue that academic research should focus less upon physical infrastructure and embrace the abstractions provided by clouds through five opportunities: user driven research, new programming models, PaaS environments, and improved tools to support elasticity and large-scale debugging. The objective of this paper is to foster discussion, and to define a roadmap forward, which will allow academia to make longer-term impacts to the cloud computing community.

preprint2014arXiv

Are Clouds Ready to Accelerate Ad hoc Financial Simulations?

Applications employed in the financial services industry to capture and estimate a variety of risk metrics are underpinned by stochastic simulations which are data, memory and computationally intensive. Many of these simulations are routinely performed on production-based computing systems. Ad hoc simulations in addition to routine simulations are required to obtain up-to-date views of risk metrics. Such simulations are currently not performed as they cannot be accommodated on production clusters, which are typically over committed resources. Scalable, on-demand and pay-as-you go Virtual Machines (VMs) offered by the cloud are a potential platform to satisfy the data, memory and computational constraints of the simulation. However, "Are clouds ready to accelerate ad hoc financial simulations?" The research reported in this paper aims to experimentally verify this question by developing and deploying an important financial simulation, referred to as 'Aggregate Risk Analysis' on the cloud. Parallel techniques to improve efficiency and performance of the simulations are explored. Challenges such as accommodating large input data on limited memory VMs and rapidly processing data for real-time use are surmounted. The key result of this investigation is that Aggregate Risk Analysis can be accommodated on cloud VMs. Acceleration of up to 24x using multiple hardware accelerators over the implementation on a single accelerator, 6x over a multiple core implementation and approximately 60x over a baseline implementation was achieved on the cloud. However, computational time is wasted for every dollar spent on the cloud due to poor acceleration over multiple virtual cores. Interestingly, private VMs can offer better performance than public VMs on comparable underlying hardware.

preprint2014arXiv

BigExcel: A Web-Based Framework for Exploring Big Data in Social Sciences

This paper argues that there are three fundamental challenges that need to be overcome in order to foster the adoption of big data technologies in non-computer science related disciplines: addressing issues of accessibility of such technologies for non-computer scientists, supporting the ad hoc exploration of large data sets with minimal effort and the availability of lightweight web-based frameworks for quick and easy analytics. In this paper, we address the above three challenges through the development of 'BigExcel', a three tier web-based framework for exploring big data to facilitate the management of user interactions with large data sets, the construction of queries to explore the data set and the management of the infrastructure. The feasibility of BigExcel is demonstrated through two Yahoo Sandbox datasets. The first dataset is the Yahoo Buzz Score data set we use for quantitatively predicting trending technologies and the second is the Yahoo n-gram corpus we use for qualitatively inferring the coverage of important events. A demonstration of the BigExcel framework and source code is available at http://bigdata.cs.st-andrews.ac.uk/projects/bigexcel-exploring-big-data-for-social-sciences/.

preprint2014arXiv

Cloud Benchmarking for Performance

How can applications be deployed on the cloud to achieve maximum performance? This question has become significant and challenging with the availability of a wide variety of Virtual Machines (VMs) with different performance capabilities in the cloud. The above question is addressed by proposing a six step benchmarking methodology in which a user provides a set of four weights that indicate how important each of the following groups: memory, processor, computation and storage are to the application that needs to be executed on the cloud. The weights along with cloud benchmarking data are used to generate a ranking of VMs that can maximise performance of the application. The rankings are validated through an empirical analysis using two case study applications; the first is a financial risk application and the second is a molecular dynamics simulation, which are both representative of workloads that can benefit from execution on the cloud. Both case studies validate the feasibility of the methodology and highlight that maximum performance can be achieved on the cloud by selecting the top ranked VMs produced by the methodology.

preprint2014arXiv

Executing Bag of Distributed Tasks on the Cloud: Investigating the Trade-offs Between Performance and Cost

Bag of Distributed Tasks (BoDT) can benefit from decentralised execution on the Cloud. However, there is a trade-off between the performance that can be achieved by employing a large number of Cloud VMs for the tasks and the monetary constraints that are often placed by a user. The research reported in this paper is motivated towards investigating this trade-off so that an optimal plan for deploying BoDT applications on the cloud can be generated. A heuristic algorithm, which considers the user's preference of performance and cost is proposed and implemented. The feasibility of the algorithm is demonstrated by generating execution plans for a sample application. The key result is that the algorithm generates optimal execution plans for the application over 91\% of the time.

preprint2014arXiv

Location, Location, Location: Data-Intensive Distributed Computing in the Cloud

When orchestrating highly distributed and data-intensive Web service workflows the geographical placement of the orchestration engine can greatly affect the overall performance of a workflow. Orchestration engines are typically run from within an organisations' network, and may have to transfer data across long geographical distances, which in turn increases execution time and degrades the overall performance of a workflow. In this paper we present CloudForecast: a Web service framework and analysis tool which given a workflow specification, computes the optimal Amazon EC2 Cloud region to automatically deploy the orchestration engine and execute the workflow. We use geographical distance of the workflow, network latency and HTTP round-trip time between Amazon Cloud regions and the workflow nodes to find a ranking of Cloud regions. This combined set of simple metrics effectively predicts where the workflow orchestration engine should be deployed in order to reduce overall execution time. We evaluate our approach by executing randomly generated data-intensive workflows deployed on the PlanetLab platform in order to rank Amazon EC2 Cloud regions. Our experimental results show that our proposed optimisation strategy, depending on the particular workflow, can speed up execution time on average by 82.25% compared to local execution. We also show that the standard deviation of execution time is reduced by an average of almost 65% using the optimisation strategy.

preprint2014arXiv

Optimal Deployment of Geographically Distributed Workflow Engines on the Cloud

When orchestrating Web service workflows, the geographical placement of the orchestration engine(s) can greatly affect workflow performance. Data may have to be transferred across long geographical distances, which in turn increases execution time and degrades the overall performance of a workflow. In this paper, we present a framework that, given a DAG-based workflow specification, computes the op- timal Amazon EC2 cloud regions to deploy the orchestration engines and execute a workflow. The framework incorporates a constraint model that solves the workflow deployment problem, which is generated using an automated constraint modelling system. The feasibility of the framework is evaluated by executing different sample workflows representative of sci- entific workloads. The experimental results indicate that the framework reduces the workflow execution time and provides a speed up of 1.3x-2.5x over centralised approaches.

preprint2014arXiv

Uncovering the Perfect Place: Optimising Workflow Engine Deployment in the Cloud

When orchestrating highly distributed and data-intensive Web service workflows the geographical placement of the orchestration engine can greatly affect the overall performance of a workflow. We present CloudForecast: a Web service framework and analysis tool which, given a workflow specification, computes the optimal Amazon EC2 Cloud region to automatically deploy the orchestration engine and execute the workflow. We use geographical distance of the workflow, network latency and HTTP round-trip time between Amazon Cloud regions and the workflow nodes to find a ranking of Cloud regions. This overall ranking predicts where the workflow orchestration engine should be deployed in order to reduce overall execution time. Our experimental results show that our proposed optimisation strategy, depending on the particular workflow, can speed up execution time on average by 82.25% compared to local execution.

preprint2014arXiv

Workflow Partitioning and Deployment on the Cloud using Orchestra

Orchestrating service-oriented workflows is typically based on a design model that routes both data and control through a single point - the centralised workflow engine. This causes scalability problems that include the unnecessary consumption of the network bandwidth, high latency in transmitting data between the services, and performance bottlenecks. These problems are highly prominent when orchestrating workflows that are composed from services dispersed across distant geographical locations. This paper presents a novel workflow partitioning approach, which attempts to improve the scalability of orchestrating large-scale workflows. It permits the workflow computation to be moved towards the services providing the data in order to garner optimal performance results. This is achieved by decomposing the workflow into smaller sub workflows for parallel execution, and determining the most appropriate network locations to which these sub workflows are transmitted and subsequently executed. This paper demonstrates the efficiency of our approach using a set of experimental workflows that are orchestrated over Amazon EC2 and across several geographic network regions.

preprint2013arXiv

A Cloud Computing Survey: Developments and Future Trends in Infrastructure as a Service Computing

Cloud computing is a recent paradigm based around the notion of delivery of resources via a service model over the Internet. Despite being a new paradigm of computation, cloud computing owes its origins to a number of previous paradigms. The term cloud computing is well defined and no longer merits rigorous taxonomies to furnish a definition. Instead this survey paper considers the past, present and future of cloud computing. As an evolution of previous paradigms, we consider the predecessors to cloud computing and what significance they still hold to cloud services. Additionally we examine the technologies which comprise cloud computing and how the challenges and future developments of these technologies will influence the field. Finally we examine the challenges that limit the growth, application and development of cloud computing and suggest directions required to overcome these challenges in order to further the success of cloud computing.

preprint2013arXiv

A Dataflow Language for Decentralised Orchestration of Web Service Workflows

Orchestrating centralised service-oriented workflows presents significant scalability challenges that include: the consumption of network bandwidth, degradation of performance, and single points of failure. This paper presents a high-level dataflow specification language that attempts to address these scalability challenges. This language provides simple abstractions for orchestrating large-scale web service workflows, and separates between the workflow logic and its execution. It is based on a data-driven model that permits parallelism to improve the workflow performance. We provide a decentralised architecture that allows the computation logic to be moved "closer" to services involved in the workflow. This is achieved through partitioning the workflow specification into smaller fragments that may be sent to remote orchestration services for execution. The orchestration services rely on proxies that exploit connectivity to services in the workflow. These proxies perform service invocations and compositions on behalf of the orchestration services, and carry out data collection, retrieval, and mediation tasks. The evaluation of our architecture implementation concludes that our decentralised approach reduces the execution time of workflows, and scales accordingly with the increasing size of data sets.

preprint2013arXiv

An Architecture for Decentralised Orchestration of Web Service Workflows

Service-oriented workflows are typically executed using a centralised orchestration approach that presents significant scalability challenges. These challenges include the consumption of network bandwidth, degradation of performance, and single-points of failure. We provide a decentralised orchestration architecture that attempts to address these challenges. Our architecture adopts a design model that permits the computation to be moved "closer" to services in a workflow. This is achieved by partitioning workflows specified using our simple dataflow language into smaller fragments, which may be sent to remote locations for execution.

preprint2013arXiv

C2MS: Dynamic Monitoring and Management of Cloud Infrastructures

Server clustering is a common design principle employed by many organisations who require high availability, scalability and easier management of their infrastructure. Servers are typically clustered according to the service they provide whether it be the application(s) installed, the role of the server or server accessibility for example. In order to optimize performance, manage load and maintain availability, servers may migrate from one cluster group to another making it difficult for server monitoring tools to continuously monitor these dynamically changing groups. Server monitoring tools are usually statically configured and with any change of group membership requires manual reconfiguration; an unreasonable task to undertake on large-scale cloud infrastructures. In this paper we present the Cloudlet Control and Management System (C2MS); a system for monitoring and controlling dynamic groups of physical or virtual servers within cloud infrastructures. The C2MS extends Ganglia - an open source scalable system performance monitoring tool - by allowing system administrators to define, monitor and modify server groups without the need for server reconfiguration. In turn administrators can easily monitor group and individual server metrics on large-scale dynamic cloud infrastructures where roles of servers may change frequently. Furthermore, we complement group monitoring with a control element allowing administrator-specified actions to be performed over servers within service groups as well as introduce further customized monitoring metrics. This paper outlines the design, implementation and evaluation of the C2MS.

preprint2013arXiv

Monitoring Large-Scale Cloud Systems with Layered Gossip Protocols

Monitoring is an essential aspect of maintaining and developing computer systems that increases in difficulty proportional to the size of the system. The need for robust monitoring tools has become more evident with the advent of cloud computing. Infrastructure as a Service (IaaS) clouds allow end users to deploy vast numbers of virtual machines as part of dynamic and transient architectures. Current monitoring solutions, including many of those in the open-source domain rely on outdated concepts including manual deployment and configuration, centralised data collection and adapt poorly to membership churn. In this paper we propose the development of a cloud monitoring suite to provide scalable and robust lookup, data collection and analysis services for large-scale cloud systems. In lieu of centrally managed monitoring we propose a multi-tier architecture using a layered gossip protocol to aggregate monitoring information and facilitate lookup, information collection and the identification of redundant capacity. This allows for a resource aware data collection and storage architecture that operates over the system being monitored. This in turn enables monitoring to be done in-situ without the need for significant additional infrastructure to facilitate monitoring services. We evaluate this approach against alternative monitoring paradigms and demonstrate how our solution is well adapted to usage in a cloud-computing context.

preprint2013arXiv

RBioCloud: A Light-weight Framework for Bioconductor and R-based Jobs on the Cloud

Large-scale ad hoc analytics of genomic data is popular using the R-programming language supported by 671 software packages provided by Bioconductor. More recently, analytical jobs are benefitting from on-demand computing and storage, their scalability and their low maintenance cost, all of which are offered by the cloud. While Biologists and Bioinformaticists can take an analytical job and execute it on their personal workstations, it remains challenging to seamlessly execute the job on the cloud infrastructure without extensive knowledge of the cloud dashboard. How analytical jobs can not only with minimum effort be executed on the cloud, but also how both the resources and data required by the job can be managed is explored in this paper. An open-source light-weight framework for executing R-scripts using Bioconductor packages, referred to as `RBioCloud', is designed and developed. RBioCloud offers a set of simple command-line tools for managing the cloud resources, the data and the execution of the job. Three biological test cases validate the feasibility of RBioCloud. The framework is publicly available from http://www.rbiocloud.com.

preprint2013arXiv

The Royal Birth of 2013: Analysing and Visualising Public Sentiment in the UK Using Twitter

Analysis of information retrieved from microblogging services such as Twitter can provide valuable insight into public sentiment in a geographic region. This insight can be enriched by visualising information in its geographic context. Two underlying approaches for sentiment analysis are dictionary-based and machine learning. The former is popular for public sentiment analysis, and the latter has found limited use for aggregating public sentiment from Twitter data. The research presented in this paper aims to extend the machine learning approach for aggregating public sentiment. To this end, a framework for analysing and visualising public sentiment from a Twitter corpus is developed. A dictionary-based approach and a machine learning approach are implemented within the framework and compared using one UK case study, namely the royal birth of 2013. The case study validates the feasibility of the framework for analysis and rapid visualisation. One observation is that there is good correlation between the results produced by the popular dictionary-based approach and the machine learning approach when large volumes of tweets are analysed. However, for rapid analysis to be possible faster methods need to be developed using big data techniques and parallel methods.

preprint2013arXiv

Undefined By Data: A Survey of Big Data Definitions

The term big data has become ubiquitous. Owing to a shared origin between academia, industry and the media there is no single unified definition, and various stakeholders provide diverse and often contradictory definitions. The lack of a consistent definition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various definitions which have gained some degree of traction and to furnish a clear and concise definition of an otherwise ambiguous term.

preprint2013arXiv

V-BOINC: The Virtualization of BOINC

The Berkeley Open Infrastructure for Network Computing (BOINC) is an open source client-server middleware system created to allow projects with large computational requirements, usually set in the scientific domain, to utilize a technically unlimited number of volunteer machines distributed over large physical distances. However various problems exist deploying applications over these heterogeneous machines using BOINC: applications must be ported to each machine architecture type, the project server must be trusted to supply authentic applications, applications that do not regularly checkpoint may lose execution progress upon volunteer machine termination and applications that have dependencies may find it difficult to run under BOINC. To solve such problems we introduce virtual BOINC, or V-BOINC, where virtual machines are used to run computations on volunteer machines. Application developers can then compile their applications on a single architecture, checkpointing issues are solved through virtualization API's and many security concerns are addressed via the virtual machine's sandbox environment. In this paper we focus on outlining a unique approach on how virtualization can be introduced into BOINC and demonstrate that V-BOINC offers acceptable computational performance when compared to regular BOINC. Finally we show that applications with dependencies can easily run under V-BOINC in turn increasing the computational potential volunteer computing offers to the general public and project developers.

Adam Barker

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

Benchmarking and Performance Modelling of MapReduce Communication Pattern

A Linked Data Scalability Challenge: Concept Reuse Leads to Semantic Decay

Cloud Benchmarking For Maximising Performance of Scientific Applications

Container-Based Cloud Virtual Machine Benchmarking

DocLite: A Docker-Based Lightweight Cloud Benchmarking Tool

Integrating Know-How into the Linked Data Cloud

Ad hoc Cloud Computing: From Concept to Realization

Autonomous Fault Detection in Self-Healing Systems using Restricted Boltzmann Machines

Budget Constrained Execution of Multiple Bag-of-Tasks Applications on the Cloud

Cloud Services Brokerage: A Survey and Research Roadmap

Executing Bag of Distributed Tasks on Virtually Unlimited Cloud Resources

Task Scheduling on the Cloud with Hard Constraints

A Semantic Web of Know-How: Linked Data for Community-Centric Tasks

Academic Cloud Computing Research: Five Pitfalls and Five Opportunities

Are Clouds Ready to Accelerate Ad hoc Financial Simulations?

BigExcel: A Web-Based Framework for Exploring Big Data in Social Sciences

Cloud Benchmarking for Performance

Executing Bag of Distributed Tasks on the Cloud: Investigating the Trade-offs Between Performance and Cost

Location, Location, Location: Data-Intensive Distributed Computing in the Cloud

Optimal Deployment of Geographically Distributed Workflow Engines on the Cloud

Uncovering the Perfect Place: Optimising Workflow Engine Deployment in the Cloud

Workflow Partitioning and Deployment on the Cloud using Orchestra

A Cloud Computing Survey: Developments and Future Trends in Infrastructure as a Service Computing

A Dataflow Language for Decentralised Orchestration of Web Service Workflows

An Architecture for Decentralised Orchestration of Web Service Workflows

C2MS: Dynamic Monitoring and Management of Cloud Infrastructures

Monitoring Large-Scale Cloud Systems with Layered Gossip Protocols

RBioCloud: A Light-weight Framework for Bioconductor and R-based Jobs on the Cloud

The Royal Birth of 2013: Analysing and Visualising Public Sentiment in the UK Using Twitter

Undefined By Data: A Survey of Big Data Definitions

V-BOINC: The Virtualization of BOINC