Source author record

Yu Mao

Yu Mao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory Machine Learning math.IT math.OC Biological Physics Computer Vision cond-mat.mtrl-sci math.AP math.NA physics.comp-ph physics.optics

Catalog footprint

What is connected

7works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Fast Transformer-based General-Purpose Lossless Compressor

Deep-learning-based compressor has received interests recently due to much improved compression ratio. However, modern approaches suffer from long execution time. To ease this problem, this paper targets on cutting down the execution time of deep-learning-based compressors. Building history-dependencies sequentially (e.g., recurrent neural networks) is responsible for long inference latency. Instead, we introduce transformer into deep learning compressors to build history-dependencies in parallel. However, existing transformer is too heavy in computation and incompatible to compression tasks. This paper proposes a fast general-purpose lossless compressor, TRACE, by designing a compression-friendly structure based on a single-layer transformer. We first design a new metric to advise the selection part of compression model structures. Byte-grouping and Shared-ffn schemes are further proposed to fully utilize the capacity of the single-layer transformer. These features allow TRACE to achieve competitive compression ratio and a much faster speed. In addition, we further accelerate the compression procedure by designing a controller to reduce the parameter updating overhead. Experiments show that TRACE achieves an overall $\sim$3x speedup while keeps a comparable compression ratio to the state-of-the-art compressors. The source code for TRACE and links to the datasets are available at https://github.com/mynotwo/A-Fast-Transformer-based-General-Purpose-LosslessCompressor.

preprint2022arXiv

Variational Nested Dropout

Nested dropout is a variant of dropout operation that is able to order network parameters or features based on the pre-defined importance during training. It has been explored for: I. Constructing nested nets: the nested nets are neural networks whose architectures can be adjusted instantly during testing time, e.g., based on computational constraints. The nested dropout implicitly ranks the network parameters, generating a set of sub-networks such that any smaller sub-network forms the basis of a larger one. II. Learning ordered representation: the nested dropout applied to the latent representation of a generative model (e.g., auto-encoder) ranks the features, enforcing explicit order of the dense representation over dimensions. However, the dropout rate is fixed as a hyper-parameter during the whole training process. For nested nets, when network parameters are removed, the performance decays in a human-specified trajectory rather than in a trajectory learned from data. For generative models, the importance of features is specified as a constant vector, restraining the flexibility of representation learning. To address the problem, we focus on the probabilistic counterpart of the nested dropout. We propose a variational nested dropout (VND) operation that draws samples of multi-dimensional ordered masks at a low cost, providing useful gradients to the parameters of nested dropout. Based on this approach, we design a Bayesian nested neural network that learns the order knowledge of the parameter distributions. We further exploit the VND under different generative models for learning ordered latent distributions. In experiments, we show that the proposed approach outperforms the nested network in terms of accuracy, calibration, and out-of-domain detection in classification tasks. It also outperforms the related generative models on data generation tasks.

preprint2022arXiv

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Weight decay is often used to ensure good generalization in the training practice of deep neural networks with batch normalization (BN-DNNs), where some convolution layers are invariant to weight rescaling due to the normalization. In this paper, we demonstrate that the practical usage of weight decay still has some unsolved problems in spite of existing theoretical work on explaining the effect of weight decay in BN-DNNs. On the one hand, when the non-adaptive learning rate e.g. SGD with momentum is used, the effective learning rate continues to increase even after the initial training stage, which leads to an overfitting effect in many neural architectures. On the other hand, in both SGDM and adaptive learning rate optimizers e.g. Adam, the effect of weight decay on generalization is quite sensitive to the hyperparameter. Thus, finding an optimal weight decay parameter requires extensive parameter searching. To address those weaknesses, we propose to regularize the weight norm using a simple yet effective weight rescaling (WRS) scheme as an alternative to weight decay. WRS controls the weight norm by explicitly rescaling it to the unit norm, which prevents a large increase to the gradient but also ensures a sufficiently large effective learning rate to improve generalization. On a variety of computer vision applications including image classification, object detection, semantic segmentation and crowd counting, we show the effectiveness and robustness of WRS compared with weight decay, implicit weight rescaling (weight standardization) and gradient projection (AdamP).

preprint2012arXiv

Reconstruction of Binary Functions and Shapes from Incomplete Frequency Information

The characterization of a binary function by partial frequency information is considered. We show that it is possible to reconstruct binary signals from incomplete frequency measurements via the solution of a simple linear optimization problem. We further prove that if a binary function is spatially structured (e.g. a general black-white image or an indicator function of a shape), then it can be recovered from very few low frequency measurements in general. These results would lead to efficient methods of sensing, characterizing and recovering a binary signal or a shape as well as other applications like deconvolution of binary functions blurred by a low-pass filter. Numerical results are provided to demonstrate the theoretical arguments.

preprint2011arXiv

A nonlinear PDE-based method for sparse deconvolution

In this paper, we introduce a new nonlinear evolution partial differential equation for sparse deconvolution problems. The proposed PDE has the form of continuity equation that arises in various research areas, e.g. fluid dynamics and optimal transportation, and thus has some interesting physical and geometric interpretations. The underlying optimization model that we consider is the standard $\ell_1$ minimization with linear equality constraints, i.e. $\min_u\{\|u\|_1 : Au=f\}$ with $A$ being an under-sampled convolution operator. We show that our PDE preserves the $\ell_1$ norm while lowering the residual $\|Au-f\|_2$. More importantly the solution of the PDE becomes sparser asymptotically, which is illustrated numerically. Therefore, it can be treated as a natural and helpful plug-in to some algorithms for $\ell_1$ minimization problems, e.g. Bregman iterative methods introduced for sparse reconstruction problems in [W. Yin, S. Osher, D. Goldfarb, and J. Darbon,SIAM J. Imaging Sci., 1 (2008), pp. 143-168]. Numerical experiments show great improvements in terms of both convergence speed and reconstruction quality.

preprint2011arXiv

Fast Linearized Bregman Iteration for Compressive Sensing and Sparse Denoising

We propose and analyze an extremely fast, efficient, and simple method for solving the problem:min{parallel to u parallel to(1) : Au = f, u is an element of R-n}.This method was first described in [J. Darbon and S. Osher, preprint, 2007], with more details in [W. Yin, S. Osher, D. Goldfarb and J. Darbon, SIAM J. Imaging Sciences, 1(1), 143-168, 2008] and rigorous theory given in [J. Cai, S. Osher and Z. Shen, Math. Comp., to appear, 2008, see also UCLA CAM Report 08-06] and [J. Cai, S. Osher and Z. Shen, UCLA CAM Report, 08-52, 2008]. The motivation was compressive sensing, which now has a vast and exciting history, which seems to have started with Candes, et. al. [E. Candes, J. Romberg and T. Tao, 52(2), 489-509, 2006] and Donoho, [D. L. Donoho, IEEE Trans. Inform. Theory, 52, 1289-1306, 2006]. See [W. Yin, S. Osher, D. Goldfarb and J. Darbon, SIAM J. Imaging Sciences 1(1), 143-168, 2008] and [J. Cai, S. Osher and Z. Shen, Math. Comp., to appear, 2008, see also UCLA CAM Report, 08-06] and [J. Cai, S. Osher and Z. Shen, UCLA CAM Report, 08-52, 2008] for a large set of references. Our method introduces an improvement called "kicking" of the very efficient method of [J. Darbon and S. Osher, preprint, 2007] and [W. Yin, S. Osher, D. Goldfarb and J. Darbon, SIAM J. Imaging Sciences, 1(1), 143-168, 2008] and also applies it to the problem of denoising of undersampled signals. The use of Bregman iteration for denoising of images began in [S. Osher, M. Burger, D. Goldfarb, J. Xu and W. Yin, Multiscale Model. Simul, 4(2), 460-489, 2005] and led to improved results for total variation based methods. Here we apply it to denoise signals, especially essentially sparse signals, which might even be undersampled.

preprint2011arXiv

Potential and Challenge of Ankylography

The concept of ankylography, which under certain circumstances enables 3D structure determination from a single view [1], had ignited a lively debate even before its publication [2,3]. Since then, a number of readers requested the ankylographic reconstruction codes from us. To facilitate a better understanding of ankylography, we posted the source codes of the ankylographic reconstruction on a public website and encouraged interested readers to download the codes and test the method [4]. Those who have tested our codes confirm that the principle of ankylography works. Furthermore, our mathematical analysis and numerical simulations suggest that, for a continuous object with array size of 14x14x14 voxels, its 3D structure can usually be reconstructed from the diffraction intensities sampled on a spherical shell of 1 voxel thick [4]. In some cases where the object does not have very dense structure, ankylography can be applied to reconstruct its 3D image with array size of 25x25x25 voxels [4]. What remains to be elucidated is how to extend ankylography to the reconstruction of larger objects, and what further theoretical, experimental and algorithm developments will be necessary to make ankylography a practical and useful imaging tool. Here we present our up-to-date understanding of the potential and challenge of ankylography. Further, we clarify some misconceptions on ankylography, and respond to technical comments raised by Wei [5] and Wang et al. [6] Finally, it is worthwhile to point out that the potential for recovering 3D information from the Fourier coefficients within a spherical shell may also find application in other fields.

Yu Mao

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

A Fast Transformer-based General-Purpose Lossless Compressor

Variational Nested Dropout

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Reconstruction of Binary Functions and Shapes from Incomplete Frequency Information

A nonlinear PDE-based method for sparse deconvolution

Fast Linearized Bregman Iteration for Compressive Sensing and Sparse Denoising

Potential and Challenge of Ankylography