Source author record

Wayne Luk

Wayne Luk appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture Computational Engineering, Finance, and Science Machine Learning Databases eess.IV Performance Programming Languages q-fin.CP q-fin.GN

Catalog footprint

What is connected

10works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

Enthuse: Efficient Adaptable High-throughput Streaming Aggregation Engines

Aggregation queries are a series of computationally-demanding analytics operations on counted, grouped or time series data. They include tasks such as summation or finding the median among the items of the same group, and within a specified number of the last observed tuples for sliding window aggregation (SWAG). They have a wide range of applications including database analytics, operating systems, bank security and medical sensors. Existing challenges include the hardware complexity that comes with efficiently handling per-group states using hash-based approaches. This paper presents Enthuse, an adaptable pipeline for calculating a wide range of aggregation queries with high throughput. It is then adapted for SWAG and achieves up to 476x speedup over the CPU core of the same platform. It achieves unparalleled levels of performance and functionality such as a throughput of 1 GT/s on our setup for SWAG without groups, and more advanced operators with up to 4x the window sizes than the state-of-the-art with groups as an approximation for SWAG featuring per-group windows using a fraction of the resources and no DRAM.

preprint2022arXiv

FLiMS: a Fast Lightweight 2-way Merger for Sorting

In this paper, we present FLiMS, a highly-efficient and simple parallel algorithm for merging two sorted lists residing in banked and/or wide memory. On FPGAs, its implementation uses fewer hardware resources than the state-of-the-art alternatives, due to the reduced number of comparators and elimination of redundant logic found on prior attempts. In combination with the distributed nature of the selector stage, a higher performance is achieved for the same amount of parallelism or higher. This is useful in many applications such as in parallel merge trees to achieve high-throughput sorting, where the resource utilisation of the merger is critical for building large trees and internalising the workload for fast computation. Also presented are efficient variations of FLiMS for optimizing throughput for skewed datasets, achieving stable sorting or using fewer dequeue signals. Additionally, FLiMS is shown to perform well as conventional software on modern CPUs supporting single-instruction multiple-data (SIMD) instructions, surpassing the performance of some standard libraries for sorting.

preprint2022arXiv

Understanding intra-day price formation process by agent-based financial market simulation: calibrating the extended chiarella model

This article presents XGB-Chiarella, a powerful new approach for deploying agent-based models to generate realistic intra-day artificial financial price data. This approach is based on agent-based models, calibrated by XGBoost machine learning surrogate. Following the Extended Chiarella model, three types of trading agents are introduced in this agent-based model: fundamental traders, momentum traders, and noise traders. In particular, XGB-Chiarella focuses on configuring the simulation to accurately reflect real market behaviours. Instead of using the original Expectation-Maximisation algorithm for parameter estimation, the agent-based Extended Chiarella model is calibrated using XGBoost machine learning surrogate. It is shown that the machine learning surrogate learned in the proposed method is an accurate proxy of the true agent-based market simulation. The proposed calibration method is superior to the original Expectation-Maximisation parameter estimation in terms of the distance between historical and simulated stylised facts. With the same underlying model, the proposed methodology is capable of generating realistic price time series in various stocks listed at three different exchanges, which indicates the universality of intra-day price formation process. For the time scale (minutes) chosen in this paper, one agent per category is shown to be sufficient to capture the intra-day price formation process. The proposed XGB-Chiarella approach provides insights that the price formation process is comprised of the interactions between momentum traders, fundamental traders, and noise traders. It can also be used to enhance risk management by practitioners.

preprint2020arXiv

An Analysis of Alternating Direction Method of Multipliers for Feed-forward Neural Networks

In this work, we present a hardware compatible neural network training algorithm in which we used alternating direction method of multipliers (ADMM) and iterative least-square methods. The motive behind this approach was to conduct a method of training neural networks that is scalable and can be parallelised. These characteristics make this algorithm suitable for hardware implementation. We have achieved 6.9\% and 6.8\% better accuracy comparing to SGD and Adam respectively, with a four-layer neural network with hidden size of 28 on HIGGS dataset. Likewise, we could observe 21.0\% and 2.2\% accuracy improvement comparing to SGD and Adam respectively, on IRIS dataset with a three-layer neural network with hidden size of 8. This is while the use of matrix inversion, which is challenging for hardware implementation, is avoided in this method. We assessed the impact of avoiding matrix inversion on ADMM accuracy and we observed that we can safely replace matrix inversion with iterative least-square methods and maintain the desired performance. Also, the computational complexity of the implemented method is polynomial regarding dimensions of the input dataset and hidden size of the network.

preprint2020arXiv

An FPGA Accelerated Method for Training Feed-forward Neural Networks Using Alternating Direction Method of Multipliers and LSMR

In this project, we have successfully designed, implemented, deployed and tested a novel FPGA accelerated algorithm for neural network training. The algorithm itself was developed in an independent study option. This training method is based on Alternating Direction Method of Multipliers algorithm, which has strong parallel characteristics and avoids procedures such as matrix inversion that are problematic in hardware designs by employing LSMR. As an intermediate stage, we fully implemented the ADMM-LSMR method in C language for feed-forward neural networks with a flexible number of layers and hidden size. We demonstrated that the method can operate with fixed-point arithmetic without compromising the accuracy. Next, we devised an FPGA accelerated version of the algorithm using Intel FPGA SDK for OpenCL and performed extensive optimisation stages followed by successful deployment of the program on an Intel Arria 10 GX FPGA. The FPGA accelerated program showed up to 6 times speed up comparing to equivalent CPU implementation while achieving promising accuracy.

preprint2020arXiv

Improving Performance Estimation for FPGA-based Accelerators for Convolutional Neural Networks

Field-programmable gate array (FPGA) based accelerators are being widely used for acceleration of convolutional neural networks (CNNs) due to their potential in improving the performance and reconfigurability for specific application instances. To determine the optimal configuration of an FPGA-based accelerator, it is necessary to explore the design space and an accurate performance prediction plays an important role during the exploration. This work introduces a novel method for fast and accurate estimation of latency based on a Gaussian process parametrised by an analytic approximation and coupled with runtime data. The experiments conducted on three different CNNs on an FPGA-based accelerator on Intel Arria 10 GX 1150 demonstrated a 30.7% improvement in accuracy with respect to the mean absolute error in comparison to a standard analytic method in leave-one-out cross-validation.

preprint2016arXiv

A Domain Specific Approach to High Performance Heterogeneous Computing

Users of heterogeneous computing systems face two problems: firstly, in understanding the trade-off relationships between the observable characteristics of their applications, such as latency and quality of the result, and secondly, how to exploit knowledge of these characteristics to allocate work to distributed computing platforms efficiently. A domain specific approach addresses both of these problems. By considering a subset of operations or functions, models of the observable characteristics or domain metrics may be formulated in advance, and populated at run-time for task instances. These metric models can then be used to express the allocation of work as a constrained integer program, which can be solved using heuristics, machine learning or Mixed Integer Linear Programming (MILP) frameworks. These claims are illustrated using the example domain of derivatives pricing in computational finance, with the domain metrics of workload latency or makespan and pricing accuracy. For a large, varied workload of 128 Black-Scholes and Heston model-based option pricing tasks, running upon a diverse array of 16 Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both the makespan and accuracy are generally within 10% of the run-time performance. When these models are used as inputs to machine learning and MILP-based workload allocation approaches, a latency improvement of up to 24 and 270 times over the heuristic approach is seen.

preprint2015arXiv

Seeing Shapes in Clouds: On the Performance-Cost trade-off for Heterogeneous Infrastructure-as-a-Service

In the near future FPGAs will be available by the hour, however this new Infrastructure as a Service (IaaS) usage mode presents both an opportunity and a challenge: The opportunity is that programmers can potentially trade resources for performance on a much larger scale, for much shorter periods of time than before. The challenge is in finding and traversing the trade-off for heterogeneous IaaS that guarantees increased resources result in the greatest possible increased performance. Such a trade-off is Pareto optimal. The Pareto optimal trade-off for clusters of heterogeneous resources can be found by solving multiple, multi-objective optimisation problems, resulting in an optimal allocation of tasks to the available platforms. Solving these optimisation programs can be done using simple heuristic approaches or formal Mixed Integer Linear Programming (MILP) techniques. When pricing 128 financial options using a Monte Carlo algorithm upon a heterogeneous cluster of Multicore CPU, GPU and FPGA platforms, the MILP approach produces a trade-off that is up to 110% faster than a heuristic approach, and over 50% cheaper. These results suggest that high quality performance-resource trade-offs of heterogeneous IaaS are best realised through a formal optimisation approach.

preprint2014arXiv

A Domain Specific Approach to Heterogeneous Computing: From Availability to Accessibility

We advocate a domain specific software development methodology for heterogeneous computing platforms such as Multicore CPUs, GPUs and FPGAs. We argue that three specific benefits are realised from adopting such an approach: portable, efficient implementations across heterogeneous platforms; domain specific metrics of quality that characterise platforms in a form software developers will understand; automatic, optimal partitioning across the available computing resources. These three benefits allow a development methodology for software developers where they describe their computational problems in a single, easy to understand form, and after a modeling procedure on the available resources, select how they would like to trade between various domain specific metrics. Our work on the Forward Financial Framework ($F^3$) demonstrates this methodology in practise. We are able to execute a range of computational finance option pricing tasks efficiently upon a wide range of CPU, GPU and FPGA computing platforms. We can also create accurate financial domain metric models of walltime latency and statistical confidence. Furthermore, we believe that we can support automatic, optimal partitioning using this execution and modelling capability.

preprint2007arXiv

Evaluation of SystemC Modelling of Reconfigurable Embedded Systems

This paper evaluates the use of pin and cycle accurate SystemC models for embedded system design exploration and early software development. The target system is MicroBlaze VanillaNet Platform running MicroBlaze uClinux operating system. The paper compares Register Transfer Level (RTL) Hardware Description Language (HDL) simulation speed to the simulation speed of several different SystemC models. It is shown that simulation speed of pin and cycle accurate models can go up to 150 kHz, compared to 100 Hz range of HDL simulation. Furthermore, utilising techniques that temporarily compromise cycle accuracy, effective simulation speed of up to 500 kHz can be obtained.

Wayne Luk

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Enthuse: Efficient Adaptable High-throughput Streaming Aggregation Engines

FLiMS: a Fast Lightweight 2-way Merger for Sorting

Understanding intra-day price formation process by agent-based financial market simulation: calibrating the extended chiarella model

An Analysis of Alternating Direction Method of Multipliers for Feed-forward Neural Networks

An FPGA Accelerated Method for Training Feed-forward Neural Networks Using Alternating Direction Method of Multipliers and LSMR

Improving Performance Estimation for FPGA-based Accelerators for Convolutional Neural Networks

A Domain Specific Approach to High Performance Heterogeneous Computing

Seeing Shapes in Clouds: On the Performance-Cost trade-off for Heterogeneous Infrastructure-as-a-Service

A Domain Specific Approach to Heterogeneous Computing: From Availability to Accessibility

Evaluation of SystemC Modelling of Reconfigurable Embedded Systems