Source author record

Simon J. Hollis

Simon J. Hollis appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture Programming Languages Data Structures and Algorithms

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2015arXiv

Overview of Swallow --- A Scalable 480-core System for Investigating the Performance and Energy Efficiency of Many-core Applications and Operating Systems

We present Swallow, a scalable many-core architecture, with a current configuration of 480 x 32-bit processors. Swallow is an open-source architecture, designed from the ground up to deliver scalable increases in usable computational power to allow experimentation with many-core applications and the operating systems that support them. Scalability is enabled by the creation of a tile-able system with a low-latency interconnect, featuring an attractive communication-to-computation ratio and the use of a distributed memory configuration. We analyse the energy and computational and communication performances of Swallow. The system provides 240GIPS with each core consuming 71--193mW, dependent on workload. Power consumption per instruction is lower than almost all systems of comparable scale. We also show how the use of a distributed operating system (nOS) allows the easy creation of scalable software to exploit Swallow's potential. Finally, we show two use case studies: modelling neurons and the overlay of shared memory on a distributed memory system.

preprint2014arXiv

A high-level model of embedded flash energy consumption

The alignment of code in the flash memory of deeply embedded SoCs can have a large impact on the total energy consumption of a computation. We investigate the effect of code alignment in six SoCs and find that a large proportion of this energy (up to 15% of total SoC energy consumption) can be saved by changes to the alignment. A flexible model is created to predict the read-access energy consumption of flash memory on deeply embedded SoCs, where code is executed in place. This model uses the instruction level memory accesses performed by the processor to calculate the flash energy consumption of a sequence of instructions. We derive the model parameters for five SoCs and validate them. The error is as low as 5%, with a 11% average normalized RMS deviation overall. The scope for using this model to optimize code alignment is explored across a range of benchmarks and SoCs. Analysis shows that over 30% of loops can be better aligned. This can significantly reduce energy while increasing code size by less than 4%. We conclude that this effect has potential as an effective optimization, saving significant energy in deeply embedded SoCs.

preprint2012arXiv

Scalable data abstractions for distributed parallel computations

The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many current parallel programming models use a shared memory model to provide data abstraction but this doesn't scale well with large numbers of cores due to non-determinism and access latency. This paper proposes a simple programming model that allows scalable parallel programs to be expressed with distributed representations of data and it provides the programmer with the flexibility to employ shared or distributed styles of data-parallelism where applicable. It is capable of an efficient implementation, and with the provision of a small set of primitive capabilities in the hardware, it can be compiled to operate directly on the hardware, in the same way stack-based allocation operates for subroutines in sequential machines.

preprint2011arXiv

Fast Distributed Process Creation with the XMOS XS1 Architecture

The provision of mechanisms for processor allocation in current distributed parallel programming models is very limited. This makes difficult, or even prohibits, the expression of a large class of programs which require a run-time assessment of their required resources. This includes programs whose structure is irregular, composite or unbounded. Efficient allocation of processors requires a process creation mechanism able to initiate and terminate remote computations quickly. This paper presents the design, demonstration and analysis of an explicit mechanism to do this, implemented on the XMOS XS1 architecture, as a foundation for a more dynamic scheme. It shows that process creation can be made efficient so that it incurs only a fractional overhead of the total runtime and that it can be combined naturally with recursion to enable rapid distribution of computations over a system.