Topic overview

Distributed, Parallel, and Cluster Computing

4102 works12794 researchers

Open map Browse papers

Map preview

Start with the graph, then narrow the list

4102works

12794researchers

Next steps

Use the topic as a working map

Open the full map for clusters, then return here to scan ranked papers and people.

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Parallel Tensor Compression for Large-Scale Scientific Data

As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulatio

preprint2013arXiv

Fast computation of computer-generated hologram using Xeon Phi coprocessor

We report fast computation of computer-generated holograms (CGHs) using Xeon Phi coprocessors, which have massively x86-based processors on one chip, recently released by Intel. CGHs can generate arbitrary light wavefronts, and therefore, are promising technology for many applications: for example, three-dimensional displays, diffractive optical elements, and the generation of arbitrary beams. CGHs incur enormous computational cost. In this paper, we describe the implementations of several CGH generating algorithms on the Xeon Phi, and the comparisons in terms of the performance and the ease of programming between the Xeon Phi, a CPU and graphics processing unit (GPU).

preprint2017arXiv

Akid: A Library for Neural Network Research and Production from a Dataism Approach

Neural networks are a revolutionary but immature technique that is fast evolving and heavily relies on data. To benefit from the newest development and newly available data, we want the gap between research and production as small as possibly. On the other hand, differing from traditional machine learning models, neural network is not just yet another statistic model, but a model for the natural processing engine --- the brain. In this work, we describe a neural network library named {\texttt akid}. It provides higher level of abstraction for entities (abstracted as blocks) in nature upon the abstraction done on signals (abstracted as tensors) by Tensorflow, characterizing the dataism observation that all entities in nature processes input and emit out in some ways. It includes a full stack of software that provides abstraction to let researchers focus on research instead of implementation, while at the same time the developed program can also be put into production seamlessly in a distributed environment, and be production ready. At the top application stack, it provides out-of-box tools for neural network applications. Lower down, akid provides a programming paradigm that lets us

preprint2017arXiv

Multi-objective dynamic virtual machine consolidation in the cloud using ant colony system

In this paper, we present a novel multi-objective ant colony system algorithm for virtual machine (VM) consolidation in cloud data centers. The proposed algorithm builds VM migration plans, which are then used to minimize over-provisioning of physical machines (PMs) by consolidating VMs on under-utilized PMs. It optimizes two objectives that are ordered by their importance. The first and foremost objective in the proposed algorithm is to maximize the number of released PMs. Moreover, since VM migration is a resource-intensive operation, it also tries to minimize the number of VM migrations. The proposed algorithm is empirically evaluated in a series of experiments. The experimental results show that the proposed algorithm provides an efficient solution for VM consolidation in cloud data centers. Moreover, it outperforms two existing ant colony optimization based VM consolidation algorithms in terms of number of released PMs and number of VM migrations.

preprint2016arXiv

The Balance Attack Against Proof-Of-Work Blockchains: The R3 Testbed as an Example

In this paper, we identify a new form of attack, called the Balance attack, against proof-of-work blockchain systems. The novelty of this attack consists of delaying network communications between multiple subgroups of nodes with balanced mining power. Our theoretical analysis captures the precise tradeoff between the network delay and the mining power of the attacker needed to double spend in Ethereum with high probability. We quantify our probabilistic analysis with statistics taken from the R3 consortium, and show that a single machine needs 20 minutes to attack the consortium. Finally, we run an Ethereum private chain in a distributed system with similar settings as R3 to demonstrate the feasibility of the approach, and discuss the application of the Balance attack to Bitcoin. Our results clearly confirm that main proof-of-work blockchain protocols can be badly suited for consortium blockchains.

preprint2016arXiv

Correctness of Hierarchical MCS Locks with Timeout

This manuscript serves as a correctness proof of the Hierarchical MCS locks with Timeout (HMCS-T) described in our paper titled "An Efficient Abortable-locking Protocol for Multi-level NUMA Systems" appearing in the proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. HMCS-T is a very involved protocol. The system is stateful; the values of prior acquisition efforts affect the subsequent acquisition efforts. Also, the status of successors, predecessors, ancestors, and descendants affect steps followed by the protocol. The ability to make the protocol fully non-blocking leads to modifications to the \texttt{next} field, which causes deviation from the original MCS lock protocol both in acquisition and release. At several places, unconditional field updates are replaced with SWAP or CAS operations. We follow a multi-step approach to prove the correctness of HMCS-T. To demonstrate the correctness of the HMCS-T lock, we use the Spin model checking. Model checking causes a combinatorial explosion even to simulate a handful of threads. First, we understand the minimal, sufficient configurations necessary to prove safety properties of a

preprint2017arXiv

Distributed Graph Layout for Scalable Small-world Network Analysis

The in-memory graph layout or organization has a considerable impact on the time and energy efficiency of distributed memory graph computations. It affects memory locality, inter-task load balance, communication time, and overall memory utilization. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, and reordering vertex and edge identifiers. In this work, we present DGL, a fast, parallel, and memory-efficient distributed graph layout strategy that is specifically designed for small-world networks (low-diameter graphs with skewed vertex degree distributions). Label propagation-based partitioning and a scalable BFS-based ordering are the main steps in the layout strategy. We show that the DGL layout can significantly improve end-to-end performance of five challenging graph analytics workloads: PageRank, a parallel subgraph enumeration program, tuned implementations of breadth-first search and single-source shortest paths, and RDF3X-MPI, a distributed SPARQL query processing engine. Using these benchmarks, we additionally offer a comprehensive analysis on how graph layout affects the perform

preprint2017arXiv

Security-related Research in Ubiquitous Computing -- Results of a Systematic Literature Review

In an endeavor to reach the vision of ubiquitous computing where users are able to use pervasive services without spatial and temporal constraints, we are witnessing a fast growing number of mobile and sensor-enhanced devices becoming available. However, in order to take full advantage of the numerous benefits offered by novel mobile devices and services, we must address the related security issues. In this paper, we present results of a systematic literature review (SLR) on security-related topics in ubiquitous computing environments. In our study, we found 5165 scientific contributions published between 2003 and 2015. We applied a systematic procedure to identify the threats, vulnerabilities, attacks, as well as corresponding defense mechanisms that are discussed in those publications. While this paper mainly discusses the results of our study, the corresponding SLR protocol which provides all details of the SLR is also publicly available for download.

preprint2013arXiv

Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as `Kepler'. We provide a review of previous optimisation strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of `performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimised storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed res

preprint2016arXiv

Request-Based Gossiping without Deadlocks

By the distributed averaging problem is meant the problem of computing the average value of a set of numbers possessed by the agents in a distributed network using only communication between neighboring agents. Gossiping is a well-known approach to the problem which seeks to iteratively arrive at a solution by allowing each agent to interchange information with at most one neighbor at each iterative step. Crafting a gossiping protocol which accomplishes this is challenging because gossiping is an inherently collaborative process which can lead to deadlocks unless careful precautions are taken to ensure that it does not. Many gossiping protocols are request-based which means simply that a gossip between two agents will occur whenever one of the two agents accepts a request to gossip placed by the other. In this paper, we present three deterministic request-based protocols. We show by example that the first can deadlock. The second is guaranteed to avoid deadlocks and requires fewer transmissions per iteration than standard broadcast-based distributed averaging protocols by exploiting the idea of local ordering together with the notion of an agent's neighbor queue; the protocol r

preprint2016arXiv

Parallel Digital Predistortion Design on Mobile GPU and Embedded Multicore CPU for Mobile Transmitters

Digital predistortion (DPD) is a widely adopted baseband processing technique in current radio transmitters. While DPD can effectively suppress unwanted spurious spectrum emissions stemming from imperfections of analog RF and baseband electronics, it also introduces extra processing complexity and poses challenges on efficient and flexible implementations, especially for mobile cellular transmitters, considering their limited computing power compared to basestations. In this paper, we present high data rate implementations of broadband DPD on modern embedded processors, such as mobile GPU and multicore CPU, by taking advantage of emerging parallel computing techniques for exploiting their computing resources. We further verify the suppression effect of DPD experimentally on real radio hardware platforms. Performance evaluation results of our DPD design demonstrate the high efficacy of modern general purpose mobile processors on accelerating DPD processing for a mobile transmitter.

preprint2016arXiv

A Peta-Scale Data Movement and Analysis in Data Warehouse (APSDMADW)

In this research paper so as to handle Information warehousing as well as online synthetic dispensation OLAP are necessary aspects of conclusion support which takes more and more turn into a focal point of the data source business.This paper offers an outline of information warehousing also OLAP systems with a highlighting on their latest necessities.All of us explain backside end tackle for extract clean-up and load information into an Data warehouse multi dimensional data model usual of OLAP frontend user tools for query and facts evaluation server extension for useful query dispensation and apparatus for metadata managing and for supervision the stockroom. Insights centered on complete data on customer actions manufactured goods act and souk performance are powerful advance and opposition in the internet gap .In this research conclude the company inspiration and the program and efficiency of servers working in a data warehouse through use of some new techniques and get better and efficient results. Data in petabyte scale. This test shows the data dropping rate in data warehouse. The locomotive is in creation at Yahoo! since 2007 and presently manages more than half a dozen peta

preprint2016arXiv

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides designers with reusable design elements. We define a design framework that enhances our understanding of the important constraints and opportunities for solutions deployed at various layers of the system stack. The framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also enables optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner in spite of frequent faults, errors, and failures of various ty

preprint2016arXiv

ASAP: Asynchronous Approximate Data-Parallel Computation

Emerging workloads, such as graph processing and machine learning are approximate because of the scale of data involved and the stochastic nature of the underlying algorithms. These algorithms are often distributed over multiple machines using bulk-synchronous processing (BSP) or other synchronous processing paradigms such as map-reduce. However, data parallel processing primitives such as repeated barrier and reduce operations introduce high synchronization overheads. Hence, many existing data-processing platforms use asynchrony and staleness to improve data-parallel job performance. Often, these systems simply change the synchronous communication to asynchronous between the worker nodes in the cluster. This improves the throughput of data processing but results in poor accuracy of the final output since different workers may progress at different speeds and process inconsistent intermediate outputs. In this paper, we present ASAP, a model that provides asynchronous and approximate processing semantics for data-parallel computation. ASAP provides fine-grained worker synchronization using NOTIFY-ACK semantics that allows independent workers to run asynchronously. ASAP also provides

preprint2016arXiv

Remarks to the paper "Control system for reducing energy consumption in backbone computer network" from "Concurrency and Computation: Practice and Experience" journal

This paper indicates two errors in the formulation of the main optimization model in the article "Control system for reducing energy consumption in backbone computer network" by Niewiadomska-Szynkiewicz et al. and shows how to fix them.

preprint2016arXiv

Fault-Tolerant Adaptive Parallel and Distributed Simulation

Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ARTÌS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization mes

preprint2016arXiv

Distributed Real-Time Sentiment Analysis for Big Data Social Streams

Big data trend has enforced the data-centric systems to have continuous fast data streams. In recent years, real-time analytics on stream data has formed into a new research field, which aims to answer queries about what-is-happening-now with a negligible delay. The real challenge with real-time stream data processing is that it is impossible to store instances of data, and therefore online analytical algorithms are utilized. To perform real-time analytics, pre-processing of data should be performed in a way that only a short summary of stream is stored in main memory. In addition, due to high speed of arrival, average processing time for each instance of data should be in such a way that incoming instances are not lost without being captured. Lastly, the learner needs to provide high analytical accuracy measures. Sentinel is a distributed system written in Java that aims to solve this challenge by enforcing both the processing and learning process to be done in distributed form. Sentinel is built on top of Apache Storm, a distributed computing platform. Sentinels learner, Vertical Hoeffding Tree, is a parallel decision tree-learning algorithm based on the VFDT, with ability of ena

preprint2016arXiv

Distributed Convex Optimization with Inequality Constraints over Time-varying Unbalanced Digraphs

This paper considers a distributed convex optimization problem with inequality constraints over time-varying unbalanced digraphs, where the cost function is a sum of local objectives, and each node of the graph only knows its local objective and inequality constraints. Although there is a vast literature on distributed optimization, most of them require the graph to be balanced, which is quite restrictive and not necessary. Very recently, the unbalanced problem has been resolved only for either time-invariant graphs or unconstrained optimization. This work addresses the unbalancedness by focusing on an epigraph form of the constrained optimization. A striking feature is that this novel idea can be easily used to study time-varying unbalanced digraphs. Under local communications, a simple iterative algorithm is then designed for each node. We prove that if the graph is uniformly jointly strongly connected, each node asymptotically converges to some common optimal solution.

preprint2016arXiv

Reduce The Wastage of Data During Movement in Data Warehouse

In this research paper so as to handle Data in warehousing as well as reduce the wastage of data and provide a better results which takes more and more turn into a focal point of the data source business. Data warehousing and on-line analytical processing (OLAP) are vital fundamentals of resolution hold, which has more and more become a focal point of the database manufacturing. Lots of marketable yield and services be at the present accessible, and the entire primary database management organization vendor nowadays have contributions in the area assessment hold up spaces some quite dissimilar necessities on record technology compare to conventional on-line transaction giving out application. This article gives a general idea of data warehousing and OLAP technologies, with the highlighting on top of their latest necessities. So tools which is used for extract, clean-up and load information into back end of a information warehouse; multidimensional data model usual of OLAP; front end client tools for querying and data analysis; server extension for proficient query processing; and tools for data managing and for administration the warehouse. In adding to survey the circumstances of

preprint2016arXiv

Timestamps for Partial Replication

Maintaining causal consistency in distributed shared memory systems using vector timestamps has received a lot of attention from both theoretical and practical prospective. However, most of the previous literature focuses on full replication where each data is stored in all replicas, which may not be scalable due to the increasing amount of data. In this report, we investigate how to achieve causal consistency in partial replicated systems, where each replica may store different set of data. We propose an algorithm that tracks causal dependencies via vector timestamp in client-server model for partial replication. The cost of our algorithm in terms of timestamps size varies as a function of the manner in which the replicas share data, and the set of replicas accessed by each client. We also establish a connection between our algorithm with the previous work on full replication.

preprint2016arXiv

Emerging Security Challenges of Cloud Virtual Infrastructure

The cloud computing model is rapidly transforming the IT landscape. Cloud computing is a new computing paradigm that delivers computing resources as a set of reliable and scalable internet-based services allowing customers to remotely run and manage these services. Infrastructure-as-a-service (IaaS) is one of the popular cloud computing services. IaaS allows customers to increase their computing resources on the fly without investing in new hardware. IaaS adapts virtualization to enable on-demand access to a pool of virtual computing resources. Although there are great benefits to be gained from cloud computing, cloud computing also enables new categories of threats to be introduced. These threats are a result of the cloud virtual infrastructure complexity created by the adoption of the virtualization technology. Breaching the security of any component in the cloud virtual infrastructure significantly impacts on the security of other components and consequently affects the overall system security. This paper explores the security problem of the cloud platform virtual infrastructure identifying the existing security threats and the complexities of this virtual infrastructure. The pa

preprint2016arXiv

Connectivity based technique for localization of nodes in wireless sensor networks

We propose a localization algorithm for wireless sensor networks, which is simple in design, does not involve significant overhead and yet provides acceptable position estimates of sensor nodes. The algorithm uses settled nodes as beacon nodes so as to increase the number of beacon nodes. The algorithm is range free and does not need any additional piece of hardware for ranging. It also does not involve any significant communication overhead for localization. The simulation and results show that good localization accuracy is achieved for outdoor environments.

preprint2016arXiv

Application-aware Retiming of Accelerators: A High-level Data-driven Approach

Flexibility at hardware level is the main driving force behind adaptive systems whose aim is to realise microarhitecture deconfiguration 'online'. This feature allows the software/hardware stack to tolerate drastic changes of the workload in data centres. With emerge of FPGA reconfigurablity this technology is becoming a mainstream computing paradigm. Adaptivity is usually accompanied by the high-level tools to facilitate multi-dimensional space exploration. An essential aspect in this space is memory orchestration where on-chip and off-chip memory distribution significantly influences the architecture in coping with the critical spatial and timing constraints, e.g. Place and Route. This paper proposes a memory smart technique for a particular class of adaptive systems: Elastic Circuits which enjoy slack elasticity at fine level of granularity. We explore retiming of a set of popular benchmarks via investigating the memory distribution within and among accelerators. The area, performance and power patterns are adopted by our high-level synthesis framework, with respect to the behaviour of the input descriptions, to improve the quality of the synthesised elastic circuits.

preprint2016arXiv

Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with "Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose "HogBatch" by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel's latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.

535 works