Source author record

James Hanlon

James Hanlon appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture Programming Languages

Catalog footprint

What is connected

4works

3topics

3close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128

The Graphcore Intelligence Processing Unit contains an original pseudorandom number generator (PRNG) called xoroshiro128aox, based on the F2-linear generator xoroshiro128. It is designed to be cheap to implement in hardware and provide high-quality statistical randomness. In this paper, we present a rigorous assessment of the generator's quality using standard statistical test suites and compare the results with the fast contemporary PRNGs xoroshiro128+, pcg64 and philox4x32-10. We show that xoroshiro128aox mitigates the known weakness in the lower order bits of xoroshiro128+ with a new 'AOX' output function by passing the BigCrush and PractRand suites, but we note that the function has some minor non uniformities. We focus our testing with specific tests for linear artefacts to highlight the weaknesses of both xoroshiro128 PRNGs, but conclude that they are hard to detect, and xoroshiro128aox otherwise provides a good trade off between statistical quality and hardware implementation cost.

preprint2015arXiv

Emulating a large memory with a collection of small ones

Sequential computation is well understood but does not scale well with current technology. Within the next decade, systems will contain large numbers of processors with potentially thousands of processors per chip. Despite this, many computational problems exhibit little or no parallelism and many existing formulations are sequential. It is therefore essential that highly-parallel architectures can support sequential computation by emulating large memories with collections of smaller ones, thus supporting efficient execution of sequential programs or sequential components of parallel programs. This paper demonstrates that a realistic parallel architecture with scalable low-latency communications can execute large-memory sequential programs with a factor of only 2 to 3 slowdown, when compared to a conventional sequential architecture. This overhead seems an acceptable price to pay to be able to switch between executing highly-parallel programs and sequential programs with large memory requirements. Efficient emulation of large memories could therefore facilitate a transition from sequential machines by allowing existing programs to be compiled directly to a highly-parallel architecture and then for their performance to be improved by exploiting parallelism in memory accesses and computation.

preprint2012arXiv

Scalable data abstractions for distributed parallel computations

The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many current parallel programming models use a shared memory model to provide data abstraction but this doesn't scale well with large numbers of cores due to non-determinism and access latency. This paper proposes a simple programming model that allows scalable parallel programs to be expressed with distributed representations of data and it provides the programmer with the flexibility to employ shared or distributed styles of data-parallelism where applicable. It is capable of an efficient implementation, and with the provision of a small set of primitive capabilities in the hardware, it can be compiled to operate directly on the hardware, in the same way stack-based allocation operates for subroutines in sequential machines.

preprint2011arXiv

Fast Distributed Process Creation with the XMOS XS1 Architecture

The provision of mechanisms for processor allocation in current distributed parallel programming models is very limited. This makes difficult, or even prohibits, the expression of a large class of programs which require a run-time assessment of their required resources. This includes programs whose structure is irregular, composite or unbounded. Efficient allocation of processors requires a process creation mechanism able to initiate and terminate remote computations quickly. This paper presents the design, demonstration and analysis of an explicit mechanism to do this, implemented on the XMOS XS1 architecture, as a foundation for a more dynamic scheme. It shows that process creation can be made efficient so that it incurs only a fractional overhead of the total runtime and that it can be combined naturally with recursion to enable rapid distribution of computations over a system.