Researcher profile

Weisi Lin

Weisi Lin contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
20works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

20 published item(s)

preprint2026arXiv

Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

Learned image compression (LIC) integrates deep neural networks (DNNs) to map high-dimensional images into compact latent representations, reducing redundancy and achieving superior rate-distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low-resolution ($3\times28\times28$) global semantic manipulation (GSM). However, high-resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near-perfect white-box attacks for classification, segmentation, and other tasks, yet fails to generalize to high-resolution GSM. Our theoretical and empirical analyses reveal that well-performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying-Oscillating-Refining stages. General $\ell_{\infty}$-bounded attacks fail on high-resolution GSM because their step-size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables $\ell_{\infty}$-bounded high-resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD$^{2}$-GSM. Extensive experiments on the Kodak $(3\times768\times512)$ demonstrate that our PGD$^{2}$-GSM is the first to stably achieve high-resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at https://github.com/chinaliangjiaming/PGD2-GSM.

preprint2024arXiv

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.

preprint2024arXiv

Q-Refine: A Perceptual Quality Refiner for AI-Generated Image

With the rapid evolution of the Text-to-Image (T2I) model in recent years, their unsatisfactory generation result has become a challenge. However, uniformly refining AI-Generated Images (AIGIs) of different qualities not only limited optimization capabilities for low-quality AIGIs but also brought negative optimization to high-quality AIGIs. To address this issue, a quality-award refiner named Q-Refine is proposed. Based on the preference of the Human Visual System (HVS), Q-Refine uses the Image Quality Assessment (IQA) metric to guide the refining process for the first time, and modify images of different qualities through three adaptive pipelines. Experimental shows that for mainstream T2I models, Q-Refine can perform effective optimization to AIGIs of different qualities. It can be a general refiner to optimize AIGIs from both fidelity and aesthetic quality levels, thus expanding the application of the T2I generation models.

preprint2022arXiv

Adjacent Context Coordination Network for Salient Object Detection in Optical Remote Sensing Images

Salient object detection (SOD) in optical remote sensing images (RSIs), or RSI-SOD, is an emerging topic in understanding optical RSIs. However, due to the difference between optical RSIs and natural scene images (NSIs), directly applying NSI-SOD methods to optical RSIs fails to achieve satisfactory results. In this paper, we propose a novel Adjacent Context Coordination Network (ACCoNet) to explore the coordination of adjacent features in an encoder-decoder architecture for RSI-SOD. Specifically, ACCoNet consists of three parts: an encoder, Adjacent Context Coordination Modules (ACCoMs), and a decoder. As the key component of ACCoNet, ACCoM activates the salient regions of output features of the encoder and transmits them to the decoder. ACCoM contains a local branch and two adjacent branches to coordinate the multi-level features simultaneously. The local branch highlights the salient regions in an adaptive way, while the adjacent branches introduce global information of adjacent levels to enhance salient regions. Additionally, to extend the capabilities of the classic decoder block (i.e., several cascaded convolutional layers), we extend it with two bifurcations and propose a Bifurcation-Aggregation Block to capture the contextual information in the decoder. Extensive experiments on two benchmark datasets demonstrate that the proposed ACCoNet outperforms 22 state-of-the-art methods under nine evaluation metrics, and runs up to 81 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/ACCoNet.

preprint2022arXiv

Auto-Weighted Layer Representation Based View Synthesis Distortion Estimation for 3-D Video Coding

Recently, various view synthesis distortion estimation models have been studied to better serve for 3-D video coding. However, they can hardly model the relationship quantitatively among different levels of depth changes, texture degeneration, and the view synthesis distortion (VSD), which is crucial for rate-distortion optimization and rate allocation. In this paper, an auto-weighted layer representation based view synthesis distortion estimation model is developed. Firstly, the sub-VSD (S-VSD) is defined according to the level of depth changes and their associated texture degeneration. After that, a set of theoretical derivations demonstrate that the VSD can be approximately decomposed into the S-VSDs multiplied by their associated weights. To obtain the S-VSDs, a layer-based representation of S-VSD is developed, where all the pixels with the same level of depth changes are represented with a layer to enable efficient S-VSD calculation at the layer level. Meanwhile, a nonlinear mapping function is learnt to accurately represent the relationship between the VSD and S-VSDs, automatically providing weights for S-VSDs during the VSD estimation. To learn such function, a dataset of VSD and its associated S-VSDs are built. Experimental results show that the VSD can be accurately estimated with the weights learnt by the nonlinear mapping function once its associated S-VSDs are available. The proposed method outperforms the relevant state-of-the-art methods in both accuracy and efficiency. The dataset and source code of the proposed method will be available at https://github.com/jianjin008/.

preprint2022arXiv

Automatic Facial Skin Feature Detection for Everyone

Automatic assessment and understanding of facial skin condition have several applications, including the early detection of underlying health problems, lifestyle and dietary treatment, skin-care product recommendation, etc. Selfies in the wild serve as an excellent data resource to democratize skin quality assessment, but suffer from several data collection challenges.The key to guaranteeing an accurate assessment is accurate detection of different skin features. We present an automatic facial skin feature detection method that works across a variety of skin tones and age groups for selfies in the wild. To be specific, we annotate the locations of acne, pigmentation, and wrinkle for selfie images with different skin tone colors, severity levels, and lighting conditions. The annotation is conducted in a two-phase scheme with the help of a dermatologist to train volunteers for annotation. We employ Unet++ as the network architecture for feature detection. This work shows that the two-phase annotation scheme can robustly detect the accurate locations of acne, pigmentation, and wrinkle for selfie images with different ethnicities, skin tone colors, severity levels, age groups, and lighting conditions.

preprint2022arXiv

Blind Quality Assessment of 3D Dense Point Clouds with Structure Guided Resampling

Objective quality assessment of 3D point clouds is essential for the development of immersive multimedia systems in real-world applications. Despite the success of perceptual quality evaluation for 2D images and videos, blind/no-reference metrics are still scarce for 3D point clouds with large-scale irregularly distributed 3D points. Therefore, in this paper, we propose an objective point cloud quality index with Structure Guided Resampling (SGR) to automatically evaluate the perceptually visual quality of 3D dense point clouds. The proposed SGR is a general-purpose blind quality assessment method without the assistance of any reference information. Specifically, considering that the human visual system (HVS) is highly sensitive to structure information, we first exploit the unique normal vectors of point clouds to execute regional pre-processing which consists of keypoint resampling and local region construction. Then, we extract three groups of quality-related features, including: 1) geometry density features; 2) color naturalness features; 3) angular consistency features. Both the cognitive peculiarities of the human brain and naturalness regularity are involved in the designed quality-aware features that can capture the most vital aspects of distorted 3D point clouds. Extensive experiments on several publicly available subjective point cloud quality databases validate that our proposed SGR can compete with state-of-the-art full-reference, reduced-reference, and no-reference quality assessment algorithms.

preprint2022arXiv

DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment

The temporal relationships between frames and their influences on video quality assessment (VQA) are still under-studied in existing works. These relationships lead to two important types of effects for video quality. Firstly, some temporal variations (such as shaking, flicker, and abrupt scene transitions) are causing temporal distortions and lead to extra quality degradations, while other variations (e.g. those related to meaningful happenings) do not. Secondly, the human visual system often has different attention to frames with different contents, resulting in their different importance to the overall video quality. Based on prominent time-series modeling ability of transformers, we propose a novel and effective transformer-based VQA method to tackle these two issues. To better differentiate temporal variations and thus capture the temporal distortions, we design a transformer-based Spatial-Temporal Distortion Extraction (STDE) module. To tackle with temporal quality attention, we propose the encoder-decoder-like temporal content transformer (TCT). We also introduce the temporal sampling on features to reduce the input length for the TCT, so as to improve the learning effectiveness and efficiency of this module. Consisting of the STDE and the TCT, the proposed Temporal Distortion-Content Transformers for Video Quality Assessment (DisCoVQA) reaches state-of-the-art performance on several VQA benchmarks without any extra pre-training datasets and up to 10% better generalization ability than existing methods. We also conduct extensive ablation experiments to prove the effectiveness of each part in our proposed model, and provide visualizations to prove that the proposed modules achieve our intention on modeling these temporal issues. We will publish our codes and pretrained weights later.

preprint2022arXiv

Distilling Knowledge from Object Classification to Aesthetics Assessment

In this work, we point out that the major dilemma of image aesthetics assessment (IAA) comes from the abstract nature of aesthetic labels. That is, a vast variety of distinct contents can correspond to the same aesthetic label. On the one hand, during inference, the IAA model is required to relate various distinct contents to the same aesthetic label. On the other hand, when training, it would be hard for the IAA model to learn to distinguish different contents merely with the supervision from aesthetic labels, since aesthetic labels are not directly related to any specific content. To deal with this dilemma, we propose to distill knowledge on semantic patterns for a vast variety of image contents from multiple pre-trained object classification (POC) models to an IAA model. Expecting the combination of multiple POC models can provide sufficient knowledge on various image contents, the IAA model can easier learn to relate various distinct contents to a limited number of aesthetic labels. By supervising an end-to-end single-backbone IAA model with the distilled knowledge, the performance of the IAA model is significantly improved by 4.8% in SRCC compared to the version trained only with ground-truth aesthetic labels. On specific categories of images, the SRCC improvement brought by the proposed method can achieve up to 7.2%. Peer comparison also shows that our method outperforms 10 previous IAA methods.

preprint2022arXiv

Exploring the Effectiveness of Video Perceptual Representation in Blind Video Quality Assessment

With the rapid growth of in-the-wild videos taken by non-specialists, blind video quality assessment (VQA) has become a challenging and demanding problem. Although lots of efforts have been made to solve this problem, it remains unclear how the human visual system (HVS) relates to the temporal quality of videos. Meanwhile, recent work has found that the frames of natural video transformed into the perceptual domain of the HVS tend to form a straight trajectory of the representations. With the obtained insight that distortion impairs the perceived video quality and results in a curved trajectory of the perceptual representation, we propose a temporal perceptual quality index (TPQI) to measure the temporal distortion by describing the graphic morphology of the representation. Specifically, we first extract the video perceptual representations from the lateral geniculate nucleus (LGN) and primary visual area (V1) of the HVS, and then measure the straightness and compactness of their trajectories to quantify the degradation in naturalness and content continuity of video. Experiments show that the perceptual representation in the HVS is an effective way of predicting subjective temporal quality, and thus TPQI can, for the first time, achieve comparable performance to the spatial quality metric and be even more effective in assessing videos with large temporal variations. We further demonstrate that by combining with NIQE, a spatial quality metric, TPQI can achieve top performance over popular in-the-wild video datasets. More importantly, TPQI does not require any additional information beyond the video being evaluated and thus can be applied to any datasets without parameter tuning. Source code is available at https://github.com/UoLMM/TPQI-VQA.

preprint2022arXiv

FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment Sampling

Current deep video quality assessment (VQA) methods are usually with high computational costs when evaluating high-resolution videos. This cost hinders them from learning better video-quality-related representations via end-to-end training. Existing approaches typically consider naive sampling to reduce the computational cost, such as resizing and cropping. However, they obviously corrupt quality-related information in videos and are thus not optimal for learning good representations for VQA. Therefore, there is an eager need to design a new quality-retained sampling scheme for VQA. In this paper, we propose Grid Mini-patch Sampling (GMS), which allows consideration of local quality by sampling patches at their raw resolution and covers global quality with contextual relations via mini-patches sampled in uniform grids. These mini-patches are spliced and aligned temporally, named as fragments. We further build the Fragment Attention Network (FANet) specially designed to accommodate fragments as inputs. Consisting of fragments and FANet, the proposed FrAgment Sample Transformer for VQA (FAST-VQA) enables efficient end-to-end deep VQA and learns effective video-quality-related representations. It improves state-of-the-art accuracy by around 10% while reducing 99.5% FLOPs on 1080P high-resolution videos. The newly learned video-quality-related representations can also be transferred into smaller VQA datasets, boosting performance in these scenarios. Extensive experiments show that FAST-VQA has good performance on inputs of various resolutions while retaining high efficiency. We publish our code at https://github.com/timothyhtimothy/FAST-VQA.

preprint2022arXiv

Full RGB Just Noticeable Difference (JND) Modelling

Just Noticeable Difference (JND) has many applications in multimedia signal processing, especially for visual data processing up to date. It's generally defined as the minimum visual content changes that the human can perspective, which has been studied for decades. However, most of the existing methods only focus on the luminance component of JND modelling and simply regard chrominance components as scaled versions of luminance. In this paper, we propose a JND model to generate the JND by taking the characteristics of full RGB channels into account, termed as the RGB-JND. To this end, an RGB-JND-NET is proposed, where the visual content in full RGB channels is used to extract features for JND generation. To supervise the JND generation, an adaptive image quality assessment combination (AIC) is developed. Besides, the RDB-JND-NET also takes the visual attention into account by automatically mining the underlying relationship between visual attention and the JND, which is further used to constrain the JND spatial distribution. To the best of our knowledge, this is the first work on careful investigation of JND modelling for full-color space. Experimental results demonstrate that the RGB-JND-NET model outperforms the relevant state-of-the-art JND models. Besides, the JND of the red and blue channels are larger than that of the green one according to the experimental results of the proposed model, which demonstrates that more changes can be tolerated in the red and blue channels, in line with the well-known fact that the human visual system is more sensitive to the green channel in comparison with the red and blue ones.

preprint2022arXiv

HVS-Inspired Signal Degradation Network for Just Noticeable Difference Estimation

Significant improvement has been made on just noticeable difference (JND) modelling due to the development of deep neural networks, especially for the recently developed unsupervised-JND generation models. However, they have a major drawback that the generated JND is assessed in the real-world signal domain instead of in the perceptual domain in the human brain. There is an obvious difference when JND is assessed in such two domains since the visual signal in the real world is encoded before it is delivered into the brain with the human visual system (HVS). Hence, we propose an HVS-inspired signal degradation network for JND estimation. To achieve this, we carefully analyze the HVS perceptual process in JND subjective viewing to obtain relevant insights, and then design an HVS-inspired signal degradation (HVS-SD) network to represent the signal degradation in the HVS. On the one hand, the well learnt HVS-SD enables us to assess the JND in the perceptual domain. On the other hand, it provides more accurate prior information for better guiding JND generation. Additionally, considering the requirement that reasonable JND should not lead to visual attention shifting, a visual attention loss is proposed to control JND generation. Experimental results demonstrate that the proposed method achieves the SOTA performance for accurately estimating the redundancy of the HVS. Source code will be available at https://github.com/jianjin008/HVS-SD-JND.

preprint2022arXiv

Just Noticeable Difference for Deep Machine Vision

As an important perceptual characteristic of the Human Visual System (HVS), the Just Noticeable Difference (JND) has been studied for decades with image and video processing (e.g., perceptual visual signal compression). However, there is little exploration on the existence of JND for the Deep Machine Vision (DMV), although the DMV has made great strides in many machine vision tasks. In this paper, we take an initial attempt, and demonstrate that the DMV has the JND, termed as the DMV-JND. We then propose a JND model for the image classification task in the DMV. It has been discovered that the DMV can tolerate distorted images with average PSNR of only 9.56dB (the lower the better), by generating JND via unsupervised learning with the proposed DMV-JND-NET. In particular, a semantic-guided redundancy assessment strategy is designed to restrain the magnitude and spatial distribution of the DMV-JND. Experimental results on image classification demonstrate that we successfully find the JND for deep machine vision. Our DMV-JND facilitates a possible direction for DMV-oriented image and video compression, watermarking, quality assessment, deep neural network security, and so on.

preprint2022arXiv

Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation

Salient object detection in optical remote sensing images (ORSI-SOD) has been widely explored for understanding ORSIs. However, previous methods focus mainly on improving the detection accuracy while neglecting the cost in memory and computation, which may hinder their real-world applications. In this paper, we propose a novel lightweight ORSI-SOD solution, named CorrNet, to address these issues. In CorrNet, we first lighten the backbone (VGG-16) and build a lightweight subnet for feature extraction. Then, following the coarse-to-fine strategy, we generate an initial coarse saliency map from high-level semantic features in a Correlation Module (CorrM). The coarse saliency map serves as the location guidance for low-level features. In CorrM, we mine the object location information between high-level semantic features through the cross-layer correlation operation. Finally, based on low-level detailed features, we refine the coarse saliency map in the refinement subnet equipped with Dense Lightweight Refinement Blocks, and produce the final fine saliency map. By reducing the parameters and computations of each component, CorrNet ends up having only 4.09M parameters and running with 21.09G FLOPs. Experimental results on two public datasets demonstrate that our lightweight CorrNet achieves competitive or even better performance compared with 26 state-of-the-art methods (including 16 large CNN-based methods and 2 lightweight methods), and meanwhile enjoys the clear memory and run time efficiency. The code and results of our method are available at https://github.com/MathLee/CorrNet.

preprint2022arXiv

Minimum Noticeable Difference based Adversarial Privacy Preserving Image Generation

Deep learning models are found to be vulnerable to adversarial examples, as wrong predictions can be caused by small perturbation in input for deep learning models. Most of the existing works of adversarial image generation try to achieve attacks for most models, while few of them make efforts on guaranteeing the perceptual quality of the adversarial examples. High quality adversarial examples matter for many applications, especially for the privacy preserving. In this work, we develop a framework based on the Minimum Noticeable Difference (MND) concept to generate adversarial privacy preserving images that have minimum perceptual difference from the clean ones but are able to attack deep learning models. To achieve this, an adversarial loss is firstly proposed to make the deep learning models attacked by the adversarial images successfully. Then, a perceptual quality-preserving loss is developed by taking the magnitude of perturbation and perturbation-caused structural and gradient changes into account, which aims to preserve high perceptual quality for adversarial image generation. To the best of our knowledge, this is the first work on exploring quality-preserving adversarial image generation based on the MND concept for privacy preserving. To evaluate its performance in terms of perceptual quality, the deep models on image classification and face recognition are tested with the proposed method and several anchor methods in this work. Extensive experimental results demonstrate that the proposed MND framework is capable of generating adversarial images with remarkably improved performance metrics (e.g., PSNR, SSIM, and MOS) than that generated with the anchor methods.

preprint2022arXiv

SelfReformer: Self-Refined Network with Transformer for Salient Object Detection

The global and local contexts significantly contribute to the integrity of predictions in Salient Object Detection (SOD). Unfortunately, existing methods still struggle to generate complete predictions with fine details. There are two major problems in conventional approaches: first, for global context, high-level CNN-based encoder features cannot effectively catch long-range dependencies, resulting in incomplete predictions. Second, downsampling the ground truth to fit the size of predictions will introduce inaccuracy as the ground truth details are lost during interpolation or pooling. Thus, in this work, we developed a Transformer-based network and framed a supervised task for a branch to learn the global context information explicitly. Besides, we adopt Pixel Shuffle from Super-Resolution (SR) to reshape the predictions back to the size of ground truth instead of the reverse. Thus details in the ground truth are untouched. In addition, we developed a two-stage Context Refinement Module (CRM) to fuse global context and automatically locate and refine the local details in the predictions. The proposed network can guide and correct itself based on the global and local context generated, thus is named, Self-Refined Transformer (SelfReformer). Extensive experiments and evaluation results on five benchmark datasets demonstrate the outstanding performance of the network, and we achieved the state-of-the-art.

preprint2022arXiv

Towards Top-Down Just Noticeable Difference Estimation of Natural Images

Just noticeable difference (JND) of natural images refers to the maximum pixel intensity change magnitude that typical human visual system (HVS) cannot perceive. Existing efforts on JND estimation mainly dedicate to modeling the diverse masking effects in either/both spatial or/and frequency domains, and then fusing them into an overall JND estimate. In this work, we turn to a dramatically different way to address this problem with a top-down design philosophy. Instead of explicitly formulating and fusing different masking effects in a bottom-up way, the proposed JND estimation model dedicates to first predicting a critical perceptual lossless (CPL) counterpart of the original image and then calculating the difference map between the original image and the predicted CPL image as the JND map. We conduct subjective experiments to determine the critical points of 500 images and find that the distribution of cumulative normalized KLT coefficient energy values over all 500 images at these critical points can be well characterized by a Weibull distribution. Given a testing image, its corresponding critical point is determined by a simple weighted average scheme where the weights are determined by a fitted Weibull distribution function. The performance of the proposed JND model is evaluated explicitly with direct JND prediction and implicitly with two applications including JND-guided noise injection and JND-guided image compression. Experimental results have demonstrated that our proposed JND model can achieve better performance than several latest JND models. In addition, we also compare the proposed JND model with existing visual difference predicator (VDP) metrics in terms of the capability in distortion detection and discrimination. The results indicate that our JND model also has a good performance in this task.

preprint2021arXiv

Collaborative Intelligence: Challenges and Opportunities

This paper presents an overview of the emerging area of collaborative intelligence (CI). Our goal is to raise awareness in the signal processing community of the challenges and opportunities in this area of growing importance, where key developments are expected to come from signal processing and related disciplines. The paper surveys the current state of the art in CI, with special emphasis on signal processing-related challenges in feature compression, error resilience, privacy, and system-level design.

preprint2020arXiv

GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Pixel Labeling

Existing CNN-based methods for pixel labeling heavily depend on multi-scale features to meet the requirements of both semantic comprehension and detail preservation. State-of-the-art pixel labeling neural networks widely exploit conventional scale-transfer operations, i.e., up-sampling and down-sampling to learn multi-scale features. In this work, we find that these operations lead to scale-confused features and suboptimal performance because they are spatial-invariant and directly transit all feature information cross scales without spatial selection. To address this issue, we propose the Gated Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features to another scale. Specifically, GSTO can work either with or without extra supervision. Unsupervised GSTO is learned from the feature itself while the supervised one is guided by the supervised probability matrix. Both forms of GSTO are lightweight and plug-and-play, which can be flexibly integrated into networks or modules for learning better multi-scale features. In particular, by plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet) for pixel labeling, and it achieves new state-of-the-art results on the COCO benchmark for human pose estimation and other benchmarks for semantic segmentation including Cityscapes, LIP and Pascal Context, with negligible extra computational cost. Moreover, experiment results demonstrate that GSTO can also significantly boost the performance of multi-scale feature aggregation modules like PPM and ASPP. Code will be made available at https://github.com/VDIGPKU/GSTO.