Researcher profile

Aimin Hao

Aimin Hao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2022arXiv

Weakly Supervised Visual-Auditory Fixation Prediction with Multigranularity Perception

Thanks to the rapid advances in deep learning techniques and the wide availability of large-scale training sets, the performance of video saliency detection models has been improving steadily and significantly. However, deep learning-based visualaudio fixation prediction is still in its infancy. At present, only a few visual-audio sequences have been furnished, with real fixations being recorded in real visual-audio environments. Hence, it would be neither efficient nor necessary to recollect real fixations under the same visual-audio circumstances. To address this problem, this paper promotes a novel approach in a weakly supervised manner to alleviate the demand of large-scale training sets for visual-audio model training. By using only the video category tags, we propose the selective class activation mapping (SCAM) and its upgrade (SCAM+). In the spatial-temporal-audio circumstance, the former follows a coarse-to-fine strategy to select the most discriminative regions, and these regions are usually capable of exhibiting high consistency with the real human-eye fixations. The latter equips the SCAM with an additional multi-granularity perception mechanism, making the whole process more consistent with that of the real human visual system. Moreover, we distill knowledge from these regions to obtain complete new spatial-temporal-audio (STA) fixation prediction (FP) networks, enabling broad applications in cases where video tags are not available. Without resorting to any real human-eye fixation, the performances of these STA FP networks are comparable to those of fully supervised networks. The code and results are publicly available at https://github.com/guotaowang/STANet.

preprint2020arXiv

A Deeper Look at Salient Object Detection: Bi-stream Network with a Small Training Dataset

Compared with the conventional hand-crafted approaches, the deep learning based methods have achieved tremendous performance improvements by training exquisitely crafted fancy networks over large-scale training sets. However, do we really need large-scale training set for salient object detection (SOD)? In this paper, we provide a deeper insight into the interrelationship between the SOD performances and the training sets. To alleviate the conventional demands for large-scale training data, we provide a feasible way to construct a novel small-scale training set, which only contains 4K images. Moreover, we propose a novel bi-stream network to take full advantage of our proposed small training set, which is consisted of two feature backbones with different structures, achieving complementary semantical saliency fusion via the proposed gate control unit. To our best knowledge, this is the first attempt to use a small-scale training set to outperform state-of-the-art models which are trained on large-scale training sets; nevertheless, our method can still achieve the leading state-of-the-art performance on five benchmark datasets.

preprint2020arXiv

A Plug-and-play Scheme to Adapt Image Saliency Deep Model for Video Data

With the rapid development of deep learning techniques, image saliency deep models trained solely by spatial information have occasionally achieved detection performance for video data comparable to that of the models trained by both spatial and temporal information. However, due to the lesser consideration of temporal information, the image saliency deep models may become fragile in the video sequences dominated by temporal information. Thus, the most recent video saliency detection approaches have adopted the network architecture starting with a spatial deep model that is followed by an elaborately designed temporal deep model. However, such methods easily encounter the performance bottleneck arising from the single stream learning methodology, so the overall detection performance is largely determined by the spatial deep model. In sharp contrast to the current mainstream methods, this paper proposes a novel plug-and-play scheme to weakly retrain a pretrained image saliency deep model for video data by using the newly sensed and coded temporal information. Thus, the retrained image saliency deep model will be able to maintain temporal saliency awareness, achieving much improved detection performance. Moreover, our method is simple yet effective for adapting any off-the-shelf pre-trained image saliency deep model to obtain high-quality video saliency detection. Additionally, both the data and source code of our method are publicly available.

preprint2020arXiv

Knowing Depth Quality In Advance: A Depth Quality Assessment Method For RGB-D Salient Object Detection

Previous RGB-D salient object detection (SOD) methods have widely adopted deep learning tools to automatically strike a trade-off between RGB and D (depth), whose key rationale is to take full advantage of their complementary nature, aiming for a much-improved SOD performance than that of using either of them solely. However, such fully automatic fusions may not always be helpful for the SOD task because the D quality itself usually varies from scene to scene. It may easily lead to a suboptimal fusion result if the D quality is not considered beforehand. Moreover, as an objective factor, the D quality has long been overlooked by previous work. As a result, it is becoming a clear performance bottleneck. Thus, we propose a simple yet effective scheme to measure D quality in advance, the key idea of which is to devise a series of features in accordance with the common attributes of high-quality D regions. To be more concrete, we conduct D quality assessments for each image region, following a multi-scale methodology that includes low-level edge consistency, mid-level regional uncertainty and high-level model variance. All these components will be computed independently and then be assembled with RGB and D features, applied as implicit indicators, to guide the selective fusion. Compared with the state-of-the-art fusion schemes, our method can achieve a more reasonable fusion status between RGB and D. Specifically, the proposed D quality measurement method achieves steady performance improvements for almost 2.0\% in general.

preprint2020arXiv

Recursive Multi-model Complementary Deep Fusion forRobust Salient Object Detection via Parallel Sub Networks

Fully convolutional networks have shown outstanding performance in the salient object detection (SOD) field. The state-of-the-art (SOTA) methods have a tendency to become deeper and more complex, which easily homogenize their learned deep features, resulting in a clear performance bottleneck. In sharp contrast to the conventional ``deeper'' schemes, this paper proposes a ``wider'' network architecture which consists of parallel sub networks with totally different network architectures. In this way, those deep features obtained via these two sub networks will exhibit large diversity, which will have large potential to be able to complement with each other. However, a large diversity may easily lead to the feature conflictions, thus we use the dense short-connections to enable a recursively interaction between the parallel sub networks, pursuing an optimal complementary status between multi-model deep features. Finally, all these complementary multi-model deep features will be selectively fused to make high-performance salient object detections. Extensive experiments on several famous benchmarks clearly demonstrate the superior performance, good generalization, and powerful learning ability of the proposed wider framework.

preprint2020arXiv

Rethinking of the Image Salient Object Detection: Object-level Semantic Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter

The real human attention is an interactive activity between our visual system and our brain, using both low-level visual stimulus and high-level semantic information. Previous image salient object detection (SOD) works conduct their saliency predictions in a multi-task manner, i.e., performing pixel-wise saliency regression and segmentation-like saliency refinement at the same time, which degenerates their feature backbones in revealing semantic information. However, given an image, we tend to pay more attention to those regions which are semantically salient even in the case that these regions are perceptually not the most salient ones at first glance. In this paper, we divide the SOD problem into two sequential tasks: 1) we propose a lightweight, weakly supervised deep network to coarsely locate those semantically salient regions first; 2) then, as a post-processing procedure, we selectively fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement. In sharp contrast to the state-of-the-art (SOTA) methods that focus on learning pixel-wise saliency in "single image" using perceptual clues mainly, our method has investigated the "object-level semantic ranks between multiple images", of which the methodology is more consistent with the real human attention mechanism. Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.