Source author record

Mingyu Wang

Mingyu Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.SP eess.IV Applications Artificial Intelligence Computational Engineering, Finance, and Science Robotics

Catalog footprint

What is connected

4works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Directional Primitives for Uncertainty-Aware Motion Estimation in Urban Environments

We can use driving data collected over a long period of time to extract rich information about how vehicles behave in different areas of the roads. In this paper, we introduce the concept of directional primitives, which is a representation of prior information of road networks. Specifically, we represent the uncertainty of directions using a mixture of von Mises distributions and associated speeds using gamma distributions. These location-dependent primitives can be combined with motion information of surrounding vehicles to predict their future behavior in the form of probability distributions. Experiments conducted on highways, intersections, and roundabouts in the Carla simulator, as well as real-world urban driving datasets, indicate that primitives lead to better uncertainty-aware motion estimation.

preprint2020arXiv

Low Precision Floating-point Arithmetic for High Performance FPGA-based CNN Acceleration

Low precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs, and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this paper, we propose a low precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP slice of Xilinx Kintex 7 family (KC705 in this paper) while one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by 64.5x over Intel i9 CPU and by 1.5x over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5x and 27.5x and improve average throughput per DSP by 4.1x and 5x, respectively. To the best of our knowledge, this is the first in-depth study to simplify one multiplication for CNN inference to one 4-bit MAC and implement four multiplications within one DSP while maintaining comparable accuracy without any re-training.

preprint2020arXiv

Phoenix: A Low-Precision Floating-Point Quantization Oriented Architecture for Convolutional Neural Networks

Convolutional neural networks (CNNs) achieve state-of-the-art performance at the cost of becoming deeper and larger. Although quantization (both fixed-point and floating-point) has proven effective for reducing storage and memory access, two challenges -- 1) accuracy loss caused by quantization without calibration, fine-tuning or re-training for deep CNNs and 2) hardware inefficiency caused by floating-point quantization -- prevent processors from completely leveraging the benefits. In this paper, we propose a low-precision floating-point quantization oriented processor, named Phoenix, to address the above challenges. We primarily have three key observations: 1) 8-bit floating-point quantization incurs less error than 8-bit fixed-point quantization; 2) without using any calibration, fine-tuning or re-training techniques, normalization before quantization further reduces accuracy degradation; 3) 8-bit floating-point multiplier achieves higher hardware efficiency than 8-bit fixed-point multiplier if the full-precision product is applied. Based on these key observations, we propose a normalization-oriented 8-bit floating-point quantization method to reduce storage and memory access with negligible accuracy loss (within 0.5%/0.3% for top-1/top-5 accuracy, respectively). We further design a hardware processor to address the hardware inefficiency caused by floating-point multiplier. Compared with a state-of-the-art accelerator, Phoenix is 3.32x and 7.45x better in performance with the same core area for AlexNet and VGG16, respectively.

preprint2020arXiv

VoxCap: FFT-Accelerated and Tucker-Enhanced Capacitance Extraction Simulator for Voxelized Structures

VoxCap, a fast Fourier transform (FFT)-accelerated and Tucker-enhanced integral equation simulator for capacitance extraction of voxelized structures, is proposed. The VoxCap solves the surface integral equations (SIEs) for conductor and dielectric surfaces with three key attributes that make the VoxCap highly CPU and memory efficient for the capacitance extraction of the voxelized structures: (i) VoxCap exploits the FFTs for accelerating the matrix-vector multiplications during the iterative solution of linear system of equations arising due to the discretization of SIEs. (ii) During the iterative solution, VoxCap uses a highly effective and memory-efficient preconditioner that reduces the number of iterations significantly. (iii) VoxCap employs Tucker decompositions to compress the block Toeplitz and circulant tensors, requiring the largest memory in the simulator. By doing so, it reduces the memory requirement of these tensors from hundreds of gigabytes to a few megabytes and the CPU time required to obtain Toeplitz tensors from tens of minutes (even hours) to a few seconds for very large scale problems. VoxCap is capable of accurately computing capacitance of arbitrarily shaped and large-scale voxelized structures on a desktop computer.

Mingyu Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Directional Primitives for Uncertainty-Aware Motion Estimation in Urban Environments

Low Precision Floating-point Arithmetic for High Performance FPGA-based CNN Acceleration

Phoenix: A Low-Precision Floating-Point Quantization Oriented Architecture for Convolutional Neural Networks

VoxCap: FFT-Accelerated and Tucker-Enhanced Capacitance Extraction Simulator for Voxelized Structures