Source author record

Wim Vanderbauwhede

Wim Vanderbauwhede appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Performance Programming Languages Hardware Architecture

Catalog footprint

What is connected

8works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Proceedings of the 1st International Workshop on Low Carbon Computing (LOCO 2024)

This is the proceedings of the 1st International Workshop on Low Carbon Computing (LOCO 2024).

preprint2015arXiv

A Reconfigurable Vector Instruction Processor for Accelerating a Convection Parametrization Model on FPGAs

High Performance Computing (HPC) platforms allow scientists to model computationally intensive algorithms. HPC clusters increasingly use General-Purpose Graphics Processing Units (GPGPUs) as accelerators; FPGAs provide an attractive alternative to GPGPUs for use as co-processors, but they are still far from being mainstream due to a number of challenges faced when using FPGA-based platforms. Our research aims to make FPGA-based high performance computing more accessible to the scientific community. In this work we present the results of investigating the acceleration of a particular atmospheric model, Flexpart, on FPGAs. We focus on accelerating the most computationally intensive kernel from this model. The key contribution of our work is the architectural exploration we undertook to arrive at a solution that best exploits the parallelism available in the legacy code, and is also convenient to program, so that eventually the compilation of high-level legacy code to our architecture can be fully automated. We present the three different types of architecture, comparing their resource utilization and performance, and propose that an architecture where there are a number of computational cores, each built along the lines of a vector instruction processor, works best in this particular scenario, and is a promising candidate for a generic FPGA-based platform for scientific computation. We also present the results of experiments done with various configuration parameters of the proposed architecture, to show its utility in adapting to a range of scientific applications.

preprint2015arXiv

An Intermediate Language and Estimator for Automated Design Space Exploration on FPGAs

We present the TyTra-IR, a new intermediate language intended as a compilation target for high-level language compilers and a front-end for HDL code generators. We develop the requirements of this new language based on the design-space of FPGAs that it should be able to express and the estimation-space in which each configuration from the design-space should be mappable in an automated design flow. We use a simple kernel to illustrate multiple configurations using the semantics of TyTra-IR. The key novelty of this work is the cost model for resource-costs and throughput for different configurations of interest for a particular kernel. Through the realistic example of a Successive Over-Relaxation kernel implemented both in TyTra-IR and HDL, we demonstrate both the expressiveness of the IR and the accuracy of our cost model.

preprint2015arXiv

Inferring Program Transformations from Type Transformations for Partitioning of Ordered Sets

In this paper I introduce a mechanism to derive program transforma- tions from order-preserving transformations of vector types. The purpose of this work is to allow automatic generation of correct-by-construction instances of programs in a streaming data processing paradigm suitable for FPGA processing. We show that for it is possible to automatically derive instances for programs based on combinations of opaque element- processing functions combined using foldl and map, purely from the type transformations.

preprint2015arXiv

Model Coupling between the Weather Research and Forecasting Model and the DPRI Large Eddy Simulator for Urban Flows on GPU-accelerated Multicore Systems

In this report we present a novel approach to model coupling for shared-memory multicore systems hosting OpenCL-compliant accelerators, which we call The Glasgow Model Coupling Framework (GMCF). We discuss the implementation of a prototype of GMCF and its application to coupling the Weather Research and Forecasting Model and an OpenCL-accelerated version of the Large Eddy Simulator for Urban Flows (LES) developed at DPRI. The first stage of this work concerned the OpenCL port of the LES. The methodology used for the OpenCL port is a combination of automated analysis and code generation and rule-based manual parallelization. For the evaluation, the non-OpenCL LES code was compiled using gfortran, fort and pgfortran}, in each case with auto-parallelization and auto-vectorization. The OpenCL-accelerated version of the LES achieves a 7 times speed-up on a NVIDIA GeForce GTX 480 GPGPU, compared to the fastest possible compilation of the original code running on a 12-core Intel Xeon E5-2640. In the second stage of this work, we built the Glasgow Model Coupling Framework and successfully used it to couple an OpenMP-parallelized WRF instance with an OpenCL LES instance which runs the LES code on the GPGPI. The system requires only very minimal changes to the original code. The report discusses the rationale, aims, approach and implementation details of this work.

preprint2014arXiv

A Parallel Task-based Approach to Linear Algebra

Processors with large numbers of cores are becoming commonplace. In order to take advantage of the available resources in these systems, the programming paradigm has to move towards increased parallelism. However, increasing the level of concurrency in the program does not necessarily lead to better performance. Parallel programming models have to provide flexible ways of defining parallel tasks and at the same time, efficiently managing the created tasks. OpenMP is a widely accepted programming model for shared-memory architectures. In this paper we highlight some of the drawbacks in the OpenMP tasking approach, and propose an alternative model based on the Glasgow Parallel Reduction Machine (GPRM) programming framework. As the main focus of this study, we deploy our model to solve a fundamental linear algebra problem, LU factorisation of sparse matrices. We have used the SparseLU benchmark from the BOTS benchmark suite, and compared the results obtained from our model to those of the OpenMP tasking approach. The TILEPro64 system has been used to run the experiments. The results are very promising, not only because of the performance improvement for this particular problem, but also because they verify the task management efficiency, stability, and flexibility of our model, which can be applied to solve problems in future many-core systems.

preprint2014arXiv

An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

The emergence of multicore and manycore processors is set to change the parallel computing world. Applications are shifting towards increased parallelism in order to utilise these architectures efficiently. This leads to a situation where every application creates its desirable number of threads, based on its parallel nature and the system resources allowance. Task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In task scheduling, not only the order of the execution, but also the mapping of threads to the execution resources is of a great importance. In this paper we state and discuss some fundamental rules based on results obtained from selected applications of the BOTS benchmarks on the 64-core TILEPro64 processor. We demonstrate how previously efficient mapping policies such as those of the SMP Linux scheduler become inefficient when the number of threads and cores grows. We propose a novel, low-overhead technique, a heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads amongst the cores in a multiprogramming environment. Our novel approach could be implemented as a pragma similar to those in the new task-based OpenMP versions, or can be incorporated as a distributed thread mapping mechanism in future manycore programming frameworks. We show that our thread mapping scheme can outperform the native GNU/Linux thread scheduler in both single-programming and multiprogramming environments.

preprint2014arXiv

Cache-aware Parallel Programming for Manycore Processors

With rapidly evolving technology, multicore and manycore processors have emerged as promising architectures to benefit from increasing transistor numbers. The transition towards these parallel architectures makes today an exciting time to investigate challenges in parallel computing. The TILEPro64 is a manycore accelerator, composed of 64 tiles interconnected via multiple 8x8 mesh networks. It contains per-tile caches and supports cache-coherent shared memory by default. In this paper we present a programming technique to take advantages of distributed caching facilities in manycore processors. However, unlike other work in this area, our approach does not use architecture-specific libraries. Instead, we provide the programmer with a novel technique on how to program future Non-Uniform Cache Architecture (NUCA) manycore systems, bearing in mind their caching organisation. We show that our localised programming approach can result in a significant improvement of the parallelisation efficiency (speed-up).

Wim Vanderbauwhede

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Proceedings of the 1st International Workshop on Low Carbon Computing (LOCO 2024)

A Reconfigurable Vector Instruction Processor for Accelerating a Convection Parametrization Model on FPGAs

An Intermediate Language and Estimator for Automated Design Space Exploration on FPGAs

Inferring Program Transformations from Type Transformations for Partitioning of Ordered Sets

Model Coupling between the Weather Research and Forecasting Model and the DPRI Large Eddy Simulator for Urban Flows on GPU-accelerated Multicore Systems

A Parallel Task-based Approach to Linear Algebra

An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

Cache-aware Parallel Programming for Manycore Processors