Source author record

Minghai Qin

Minghai Qin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Neural and Evolutionary Computing Information Theory math.IT Artificial Intelligence Distributed, Parallel, and Cluster Computing eess.IV Emerging Technologies Hardware Architecture

Catalog footprint

What is connected

11works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Adaptive Read Thresholds for NAND Flash

A primary source of increased read time on NAND flash comes from the fact that in the presence of noise, the flash medium must be read several times using different read threshold voltages for the decoder to succeed. This paper proposes an algorithm that uses a limited number of re-reads to characterize the noise distribution and recover the stored information. Both hard and soft decoding are considered. For hard decoding, the paper attempts to find a read threshold minimizing bit-error-rate (BER) and derives an expression for the resulting codeword-error-rate. For soft decoding, it shows that minimizing BER and minimizing codeword-error-rate are competing objectives in the presence of a limited number of allowed re-reads, and proposes a trade-off between the two. The proposed method does not require any prior knowledge about the noise distribution, but can take advantage of such information when it is available. Each read threshold is chosen based on the results of previous reads, following an optimal policy derived through a dynamic programming backward recursion. The method and results are studied from the perspective of an SLC Flash memory with Gaussian noise for each level but the paper explains how the method could be extended to other scenarios.

preprint2022arXiv

CHEX: CHannel EXploration for CNN Model Compression

Channel pruning has been broadly recognized as an effective technique to reduce the computation and memory cost of deep convolutional neural networks. However, conventional pruning methods have limitations in that: they are restricted to pruning process only, and they require a fully pre-trained large model. Such limitations may lead to sub-optimal model quality as well as excessive memory and training cost. In this paper, we propose a novel Channel Exploration methodology, dubbed as CHEX, to rectify these problems. As opposed to pruning-only strategy, we propose to repeatedly prune and regrow the channels throughout the training process, which reduces the risk of pruning important channels prematurely. More exactly: From intra-layer's aspect, we tackle the channel pruning problem via a well known column subset selection (CSS) formulation. From inter-layer's aspect, our regrowing stages open a path for dynamically re-allocating the number of channels across all the layers under a global channel sparsity constraint. In addition, all the exploration process is done in a single training from scratch without the need of a pre-trained large model. Experimental results demonstrate that CHEX can effectively reduce the FLOPs of diverse CNN architectures on a variety of computer vision tasks, including image classification, object detection, instance segmentation, and 3D vision. For example, our compressed ResNet-50 model on ImageNet dataset achieves 76% top1 accuracy with only 25% FLOPs of the original ResNet-50 model, outperforming previous state-of-the-art channel pruning methods. The checkpoints and code are available at here .

preprint2022arXiv

Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

Deep learning-based super-resolution (SR) has gained tremendous popularity in recent years because of its high image quality performance and wide application scenarios. However, prior methods typically suffer from large amounts of computations and huge power consumption, causing difficulties for real-time inference, especially on resource-limited platforms such as mobile devices. To mitigate this, we propose a compiler-aware SR neural architecture search (NAS) framework that conducts depth search and per-layer width search with adaptive SR blocks. The inference speed is directly taken into the optimization along with the SR loss to derive SR models with high image quality while satisfying the real-time inference requirement. Instead of measuring the speed on mobile devices at each iteration during the search process, a speed model incorporated with compiler optimizations is leveraged to predict the inference latency of the SR block with various width configurations for faster convergence. With the proposed framework, we achieve real-time SR inference for implementing 720p resolution with competitive SR performance (in terms of PSNR and SSIM) on GPU/DSP of mobile platforms (Samsung Galaxy S21).

preprint2022arXiv

Effective Model Sparsification by Scheduled Grow-and-Prune Methods

Deep neural networks (DNNs) are effective in solving many real-world problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but their excessive computation results in long inference time. Model sparsification can reduce the computation and memory cost while maintaining model quality. Most existing sparsification algorithms unidirectionally remove weights, while others randomly or greedily explore a small subset of weights in each layer for pruning. The limitations of these algorithms reduce the level of achievable sparsity. In addition, many algorithms still require pre-trained dense models and thus suffer from large memory footprint. In this paper, we propose a novel scheduled grow-and-prune (GaP) methodology without having to pre-train a dense model. It addresses the shortcomings of the previous works by repeatedly growing a subset of layers to dense and then pruning them back to sparse after some training. Experiments show that the models pruned using the proposed methods match or beat the quality of the highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) methods for model sparsification. As an example, a 90% non-uniform sparse ResNet-50 model obtained via GaP achieves 77.9% top-1 accuracy on ImageNet, improving the previous SOTA results by 1.5%. Code available at: https://github.com/boone891214/GaP.

preprint2022arXiv

Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss. In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer by 1.81, 4.18 and 1.90 times on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

preprint2020arXiv

A Unified DNN Weight Compression Framework Using Reweighted Optimization Methods

To address the large model size and intensive computation requirement of deep neural networks (DNNs), weight pruning techniques have been proposed and generally fall into two categories, i.e., static regularization-based pruning and dynamic regularization-based pruning. However, the former method currently suffers either complex workloads or accuracy degradation, while the latter one takes a long time to tune the parameters to achieve the desired pruning rate without accuracy loss. In this paper, we propose a unified DNN weight pruning framework with dynamically updated regularization terms bounded by the designated constraint, which can generate both non-structured sparsity and different kinds of structured sparsity. We also extend our method to an integrated framework for the combination of different DNN compression tasks.

preprint2020arXiv

Computation on Sparse Neural Networks: an Inspiration for Future Hardware

Neural network models are widely used in solving many challenging problems, such as computer vision, personalized recommendation, and natural language processing. Those models are very computationally intensive and reach the hardware limit of the existing server and IoT devices. Thus, finding better model architectures with much less amount of computation while maximally preserving the accuracy is a popular research topic. Among various mechanisms that aim to reduce the computation complexity, identifying the zero values in the model weights and in the activations to avoid computing them is a promising direction. In this paper, we summarize the current status of the research on the computation of sparse neural networks, from the perspective of the sparse algorithms, the software frameworks, and the hardware accelerations. We observe that the search for the sparse structure can be a general methodology for high-quality model explorations, in addition to a strategy for high-efficiency model execution. We discuss the model accuracy influenced by the number of weight parameters and the structure of the model. The corresponding models are called to be located in the weight dominated and structure dominated regions, respectively. We show that for practically complicated problems, it is more beneficial to search large and sparse models in the weight dominated region. In order to achieve the goal, new approaches are required to search for proper sparse structures, and new sparse training hardware needs to be developed to facilitate fast iterations of sparse models.

preprint2020arXiv

Learning in the Frequency Domain

Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learning-based frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the well-known neural networks, such as ResNet-50, MobileNetV2, and Mask R-CNN, while accepting the frequency-domain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same input size, the proposed method achieves 1.41% and 0.66% top-1 accuracy improvements on ResNet-50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top-1 accuracy on ResNet-50 by 1%. In addition, we observe a 0.8% average precision improvement on Mask R-CNN for instance segmentation on the COCO dataset.

preprint2020arXiv

Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks

In cloud and edge computing models, it is important that compute devices at the edge be as power efficient as possible. Long short-term memory (LSTM) neural networks have been widely used for natural language processing, time series prediction and many other sequential data tasks. Thus, for these applications there is increasing need for low-power accelerators for LSTM model inference at the edge. In order to reduce power dissipation due to data transfers within inference devices, there has been significant interest in accelerating vector-matrix multiplication (VMM) operations using non-volatile memory (NVM) weight arrays. In NVM array-based hardware, reduced bit-widths also significantly increases the power efficiency. In this paper, we focus on the application of quantization-aware training algorithm to LSTM models, and the benefits these models bring in terms of resilience against both quantization error and analog device noise. We have shown that only 4-bit NVM weights and 4-bit ADC/DACs are needed to produce equivalent LSTM network performance as floating-point baseline. Reasonable levels of ADC quantization noise and weight noise can be naturally tolerated within our NVMbased quantized LSTM network. Benchmark analysis of our proposed LSTM accelerator for inference has shown at least 2.4x better computing efficiency and 40x higher area efficiency than traditional digital approaches (GPU, FPGA, and ASIC). Some other novel approaches based on NVM promise to deliver higher computing efficiency (up to 4.7x) but require larger arrays with potential higher error rates.

preprint2016arXiv

Joint Source-Channel Decoding of Polar Codes for Language-Based Source

We exploit the redundancy of the language-based source to help polar decoding. By judging the validity of decoded words in the decoded sequence with the help of a dictionary, the polar list decoder constantly detects erroneous paths after every few bits are decoded. This path-pruning technique based on joint decoding has advantages over stand-alone polar list decoding in that most decoding errors in early stages are corrected. In order to facilitate the joint decoding, we first propose a construction of dynamic dictionary using a trie and show an efficient way to trace the dictionary during decoding. Then we propose a joint decoding scheme of polar codes taking into account both information from the channel and the source. The proposed scheme has the same decoding complexity as the list decoding of polar codes. A list-size adaptive joint decoding is further implemented to largely reduce the decoding complexity. We conclude by simulation that the joint decoding schemes outperform stand-alone polar codes with CRC-aided successive cancellation list decoding by over 0.6 dB.

preprint2012arXiv

Time-Space Constrained Codes for Phase-Change Memories

Phase-change memory (PCM) is a promising non-volatile solid-state memory technology. A PCM cell stores data by using its amorphous and crystalline states. The cell changes between these two states using high temperature. However, since the cells are sensitive to high temperature, it is important, when programming cells, to balance the heat both in time and space. In this paper, we study the time-space constraint for PCM, which was originally proposed by Jiang et al. A code is called an \emph{$(α,β,p)$-constrained code} if for any $α$ consecutive rewrites and for any segment of $β$ contiguous cells, the total rewrite cost of the $β$ cells over those $α$ rewrites is at most $p$. Here, the cells are binary and the rewrite cost is defined to be the Hamming distance between the current and next memory states. First, we show a general upper bound on the achievable rate of these codes which extends the results of Jiang et al. Then, we generalize their construction for $(α\geq 1, β=1,p=1)$-constrained codes and show another construction for $(α= 1, β\geq 1,p\geq1)$-constrained codes. Finally, we show that these two constructions can be used to construct codes for all values of $α$, $β$, and $p$.

Minghai Qin

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Adaptive Read Thresholds for NAND Flash

CHEX: CHannel EXploration for CNN Model Compression

Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

Effective Model Sparsification by Scheduled Grow-and-Prune Methods

Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

A Unified DNN Weight Compression Framework Using Reweighted Optimization Methods

Computation on Sparse Neural Networks: an Inspiration for Future Hardware

Learning in the Frequency Domain

Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks

Joint Source-Channel Decoding of Polar Codes for Language-Based Source

Time-Space Constrained Codes for Phase-Change Memories