Source author record

Ruochen Hao

Ruochen Hao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing eess.SY math.OC Systems and Control

Catalog footprint

What is connected

3works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs

General Matrix Multiplication (GEMM) has a wide range of applications in scientific simulation and artificial intelligence. Although traditional libraries can achieve high performance on large regular-shaped GEMMs, they often behave not well on irregular-shaped GEMMs, which are often found in new algorithms and applications of high-performance computing (HPC). Due to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have become an alternative architecture in HPC systems. Targeting multi-core DSPs in FT-m7032, a prototype CPU-DSPs heterogeneous processor for HPC, an efficient implementation - ftIMM - for three types of irregular-shaped GEMMs is proposed. FtIMM supports automatic generation of assembly micro-kernels, two parallelization strategies, and auto-tuning of block sizes and parallelization strategies. The experiments show that ftIMM can get better performance than the traditional GEMM implementations on multi-core DSPs in FT-m7032, yielding on up to 7.2x performance improvement, when performing on irregular-shaped GEMMs. And ftIMM on multi-core DSPs can also far outperform the open source library on multi-core CPUs in FT-m7032, delivering up to 3.1x higher efficiency.

preprint2022arXiv

Towards Effective Depthwise Convolutions on ARMv8 Architecture

Depthwise convolutions are widely used in lightweight convolutional neural networks (CNNs). The performance of depthwise convolutions is mainly bounded by the memory access rather than the arithmetic operations for classic convolutions so that direct algorithms are often more efficient than indirect ones (matrix multiplication-, Winograd-, and FFT-based convolutions) with additional memory accesses. However, the existing direct implementations of depthwise convolutions on ARMv8 architectures feature a bad trade-off between register-level reuse of different tensors, which usually leads to sub-optimal performance. In this paper, we propose new direct implementations of depthwise convolutions by means of implicit padding, register tiling, etc., which contain forward propagation, backward propagation and weight gradient update procedures. Compared to the existing ones, our new implementations can incur much less communication overhead between registers and cache. Experimental results on two ARMv8 CPUs show that our implementations can averagely deliver 4.88x and 16.4x performance improvement over the existing direct ones in open source libraries and matrix multiplications-based ones in Pytorch, respectively.

preprint2020arXiv

Managing connected and automated vehicles with flexible routing at "lane-allocation-free'' intersections

Trajectory planning and coordination for connected and automated vehicles (CAVs) have been studied at isolated ``signal-free'' intersections and in ``signal-free'' corridors under the fully CAV environment in the literature. Most of the existing studies are based on the definition of approaching and exit lanes. The route a vehicle takes to pass through an intersection is determined from its movement. That is, only the origin and destination arms are included. This study proposes a mixed-integer linear programming (MILP) model to optimize vehicle trajectories at an isolated ``signal-free'' intersection without lane allocation, which is denoted as ``lane-allocation-free'' (LAF) control. Each lane can be used as both approaching and exit lanes for all vehicle movements including left-turn, through, and right-turn. A vehicle can take a flexible route by way of multiple arms to pass through the intersection. In this way, the spatial-temporal resources are expected to be fully utilized. The interactions between vehicle trajectories are modeled explicitly at the microscopic level. Vehicle routes and trajectories (i.e., car-following and lane-changing behaviors) at the intersection are optimized in one unified framework for system optimality in terms of total vehicle delay. Considering varying traffic conditions, the planning horizon is adaptively adjusted in the implementation procedure of the proposed model to make a balance between solution feasibility and computational burden. Numerical studies validate the advantages of the proposed LAF control in terms of both vehicle delay and throughput with different demand structures and temporal safety gaps.