Source author record

Paul H. J. Kelly

Paul H. J. Kelly appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Computer Vision Mathematical Software Robotics Performance Artificial Intelligence eess.SP Hardware Architecture Machine Learning

Catalog footprint

What is connected

12works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Systematic Comparison of Path Planning Algorithms using PathBench

Path planning is an essential component of mobile robotics. Classical path planning algorithms, such as wavefront and rapidly-exploring random tree (RRT) are used heavily in autonomous robots. With the recent advances in machine learning, development of learning-based path planning algorithms has been experiencing rapid growth. An unified path planning interface that facilitates the development and benchmarking of existing and new algorithms is needed. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learning-based path planning algorithms in 2D and 3D grid world environments. Many existing path planning algorithms are supported; e.g. A*, Dijkstra, waypoint planning networks, value iteration networks, gated path planning networks; and integrating new algorithms is easy and clearly specified. The benchmarking ability of PathBench is explored in this paper by comparing algorithms across five different hardware systems and three different map types, including built-in PathBench maps, video game maps, and maps from real world databases. Metrics, such as path length, success rate, and computational time, were used to evaluate algorithms. Algorithmic analysis was also performed on a real world robot to demonstrate PathBench's support for Robot Operating System (ROS). PathBench is open source.

preprint2021arXiv

Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices' limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain - a compiler that targets SCAMP-5, a general-purpose FPSP - which generates code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5.

preprint2021arXiv

Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid ("off-the-grid"). Our work is motivated by modeling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x.

preprint2020arXiv

A study of vectorization for matrix-free finite element methods

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30\% of theoretical peak performance for many examples of practical significance, and exceed 50\% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.

preprint2020arXiv

AnalogNet: Convolutional Neural Network Inference on Analog Focal Plane Sensor Processors

We present a high-speed, energy-efficient Convolutional Neural Network (CNN) architecture utilising the capabilities of a unique class of devices known as analog Focal Plane Sensor Processors (FPSP), in which the sensor and the processor are embedded together on the same silicon chip. Unlike traditional vision systems, where the sensor array sends collected data to a separate processor for processing, FPSPs allow data to be processed on the imaging device itself. This unique architecture enables ultra-fast image processing and high energy efficiency, at the expense of limited processing resources and approximate computations. In this work, we show how to convert standard CNNs to FPSP code, and demonstrate a method of training networks to increase their robustness to analog computation errors. Our proposed architecture, coined AnalogNet, reaches a testing accuracy of 96.9% on the MNIST handwritten digits recognition task, at a speed of 2260 FPS, for a cost of 0.7 mJ per frame.

preprint2020arXiv

Architecture and performance of Devito, a system for automated stencil computation

Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process---from mathematical equations down to C++ code---is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications.

preprint2016arXiv

A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake

We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well-ordered. We characterise the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretisations. On meshes with realistic numbers of layers the performance achieved is between 70% and 90% of a theoretical hardware-specific limit.

preprint2016arXiv

Comparative Design Space Exploration of Dense and Semi-Dense SLAM

SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these systems. Further investigation at the level of individual kernels and parameter spaces of SLAM pipelines is non-existent, which is absolutely essential for systems research and integration. We extend the recently introduced SLAMBench framework to allow comparing two state-of-the-art SLAM pipelines, namely KinectFusion and LSD-SLAM, along the metrics of accuracy, energy consumption, and processing frame rate on two different hardware platforms, namely a desktop and an embedded device. We also analyze the pipelines at the level of individual kernels and explore their algorithmic and hardware design spaces for the first time, yielding valuable insights.

preprint2015arXiv

An Interrupt-Driven Work-Sharing For-Loop Scheduler

In this paper we present a parallel for-loop scheduler which is based on work-stealing principles but runs under a completely cooperative scheme. POSIX signals are used by idle threads to interrupt left-behind workers, which in turn decide what portion of their workload can be given to the requester. We call this scheme Interrupt-Driven Work-Sharing (IDWS). This article describes how IDWS works, how it can be integrated into any POSIX-compliant OpenMP implementation and how a user can manually replace OpenMP parallel for-loops with IDWS in existing POSIX-compliant C++ applications. Additionally, we measure its performance using both a synthetic benchmark with varying distributions of workload across the iteration space and a real-life application on Sandy Bridge and Xeon Phi systems. Regardless the workload distribution and the underlying hardware, IDWS is always the best or among the best-performing strategies, providing a good all-around solution to the scheduling-choice dilemma.

preprint2015arXiv

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPUaccelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.

preprint2015arXiv

Thread Parallelism for Highly Irregular Computation in Anisotropic Mesh Adaptation

Thread-level parallelism in irregular applications with mutable data dependencies presents challenges because the underlying data is extensively modified during execution of the algorithm and a high degree of parallelism must be realized while keeping the code race-free. In this article we describe a methodology for exploiting thread parallelism for a class of graph-mutating worklist algorithms, which guarantees safe parallel execution via processing in rounds of independent sets and using a deferred update strategy to commit changes in the underlying data structures. Scalability is assisted by atomic fetch-and-add operations to create worklists and work-stealing to balance the shared-memory workload. This work is motivated by mesh adaptation algorithms, for which we show a parallel efficiency of 60% and 50% on Intel(R) Xeon(R) Sandy Bridge and AMD Opteron(tm) Magny-Cours systems, respectively, using these techniques.

preprint2013arXiv

A thread-parallel algorithm for anisotropic mesh adaptation

Anisotropic mesh adaptation is a powerful way to directly minimise the computational cost of mesh based simulation. It is particularly important for multi-scale problems where the required number of floating-point operations can be reduced by orders of magnitude relative to more traditional static mesh approaches. Increasingly, finite element and finite volume codes are being optimised for modern multi-core architectures. Typically, decomposition methods implemented through the Message Passing Interface (MPI) are applied for inter-node parallelisation, while a threaded programming model, such as OpenMP, is used for intra-node parallelisation. Inter-node parallelism for mesh adaptivity has been successfully implemented by a number of groups. However, thread-level parallelism is significantly more challenging because the underlying data structures are extensively modified during mesh adaptation and a greater degree of parallelism must be realised. In this paper we describe a new thread-parallel algorithm for anisotropic mesh adaptation algorithms. For each of the mesh optimisation phases (refinement, coarsening, swapping and smoothing) we describe how independent sets of tasks are defined. We show how a deferred updates strategy can be used to update the mesh data structures in parallel and without data contention. We show that despite the complex nature of mesh adaptation and inherent load imbalances in the mesh adaptivity, a parallel efficiency of 60% is achieved on an 8 core Intel Xeon Sandybridge, and a 40% parallel efficiency is achieved using 16 cores in a 2 socket Intel Xeon Sandybridge ccNUMA system.

Paul H. J. Kelly

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Systematic Comparison of Path Planning Algorithms using PathBench

Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

A study of vectorization for matrix-free finite element methods

AnalogNet: Convolutional Neural Network Inference on Analog Focal Plane Sensor Processors

Architecture and performance of Devito, a system for automated stencil computation

A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake

Comparative Design Space Exploration of Dense and Semi-Dense SLAM

An Interrupt-Driven Work-Sharing For-Loop Scheduler

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Thread Parallelism for Highly Irregular Computation in Anisotropic Mesh Adaptation

A thread-parallel algorithm for anisotropic mesh adaptation