Source author record

Ali Borji

Ali Borji appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Artificial Intelligence Cryptography and Security eess.IV Neural and Evolutionary Computing

Catalog footprint

What is connected

27works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection

Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.

preprint2022arXiv

A New Kind of Adversarial Example

Almost all adversarial attacks are formulated to add an imperceptible perturbation to an image in order to fool a model. Here, we consider the opposite which is adversarial examples that can fool a human but not a model. A large enough and perceptible perturbation is added to an image such that a model maintains its original decision, whereas a human will most likely make a mistake if forced to decide (or opt not to decide at all). Existing targeted attacks can be reformulated to synthesize such adversarial examples. Our proposed attack, dubbed NKE, is similar in essence to the fooling images, but is more efficient since it uses gradient descent instead of evolutionary algorithms. It also offers a new and unified perspective into the problem of adversarial vulnerability. Experimental results over MNIST and CIFAR-10 datasets show that our attack is quite efficient in fooling deep neural networks. Code is available at https://github.com/aliborji/NKE.

preprint2022arXiv

Complementary datasets to COCO for object detection

For nearly a decade, the COCO dataset has been the central test bed of research in object detection. According to the recent benchmarks, however, it seems that performance on this dataset has started to saturate. One possible reason can be that perhaps it is not large enough for training deep models. To address this limitation, here we introduce two complementary datasets to COCO: i) COCO_OI, composed of images from COCO and OpenImages (from their 80 classes in common) with 1,418,978 training bounding boxes over 380,111 images, and 41,893 validation bounding boxes over 18,299 images, and ii) ObjectNet_D containing objects in daily life situations (originally created for object recognition known as ObjectNet; 29 categories in common with COCO). The latter can be used to test the generalization ability of object detectors. We evaluate some models on these datasets and pinpoint the source of errors. We encourage the community to utilize these datasets for training and testing object detection models. Code and data is available at https://github.com/aliborji/COCO_OI.

preprint2022arXiv

How good are deep models in understanding the generated images?

My goal in this paper is twofold: to study how well deep models can understand the images generated by DALL-E 2 and Midjourney, and to quantitatively evaluate these generative models. Two sets of generated images are collected for object recognition and visual question answering (VQA) tasks. On object recognition, the best model, out of 10 state-of-the-art object recognition models, achieves about 60\% and 80\% top-1 and top-5 accuracy, respectively. These numbers are much lower than the best accuracy on the ImageNet dataset (91\% and 99\%). On VQA, the OFA model scores 77.3\% on answering 241 binary questions across 50 images. This model scores 94.7\% on the binary VQA-v2 dataset. Humans are able to recognize the generated images and answer questions on them easily. We conclude that a) deep models struggle to understand the generated content, and may do better after fine-tuning, and b) there is a large distribution shift between the generated images and the real photographs. The distribution shift appears to be category-dependent. Data is available at: https://drive.google.com/file/d/1n2nCiaXtYJRRF2R73-LNE3zggeU_HeH0/view?usp=sharing.

preprint2022arXiv

Is current research on adversarial robustness addressing the right problem?

Short answer: Yes, Long answer: No! Indeed, research on adversarial robustness has led to invaluable insights helping us understand and explore different aspects of the problem. Many attacks and defenses have been proposed over the last couple of years. The problem, however, remains largely unsolved and poorly understood. Here, I argue that the current formulation of the problem serves short term goals, and needs to be revised for us to achieve bigger gains. Specifically, the bound on perturbation has created a somewhat contrived setting and needs to be relaxed. This has misled us to focus on model classes that are not expressive enough to begin with. Instead, inspired by human vision and the fact that we rely more on robust features such as shape, vertices, and foreground objects than non-robust features such as texture, efforts should be steered towards looking for significantly different classes of models. Maybe instead of narrowing down on imperceptible adversarial perturbations, we should attack a more general problem which is finding architectures that are simultaneously robust to perceptible perturbations, geometric transformations (e.g. rotation, scaling), image distortions (lighting, blur), and more (e.g. occlusion, shadow). Only then we may be able to solve the problem of adversarial vulnerability.

preprint2022arXiv

Sensitivity of Average Precision to Bounding Box Perturbations

Object detection is a fundamental vision task. It has been highly researched in academia and has been widely adopted in industry. Average Precision (AP) is the standard score for evaluating object detectors. Our understanding of the subtleties of this score, however, is limited. Here, we quantify the sensitivity of AP to bounding box perturbations and show that AP is very sensitive to small translations. Only one pixel shift is enough to drop the mAP of a model by 8.4%. The mAP drop over small objects with only one pixel shift is 23.1%. The corresponding numbers when ground-truth (GT) boxes are used as predictions are 23% and 41.7%, respectively. These results explain why achieving higher mAP becomes increasingly harder as models get better. We also investigate the effect of box scaling on AP. Code and data is available at https://github.com/aliborji/AP_Box_Perturbation.

preprint2022arXiv

SplitMixer: Fat Trimmed From MLP-like Models

We present SplitMixer, a simple and lightweight isotropic MLP-like architecture, for visual recognition. It contains two types of interleaving convolutional operations to mix information across spatial locations (spatial mixing) and channels (channel mixing). The first one includes sequentially applying two depthwise 1D kernels, instead of a 2D kernel, to mix spatial information. The second one is splitting the channels into overlapping or non-overlapping segments, with or without shared parameters, and applying our proposed channel mixing approaches or 3D convolution to mix channel information. Depending on design choices, a number of SplitMixer variants can be constructed to balance accuracy, the number of parameters, and speed. We show, both theoretically and experimentally, that SplitMixer performs on par with the state-of-the-art MLP-like models while having a significantly lower number of parameters and FLOPS. For example, without strong data augmentation and optimization, SplitMixer achieves around 94% accuracy on CIFAR-10 with only 0.28M parameters, while ConvMixer achieves the same accuracy with about 0.6M parameters. The well-known MLP-Mixer achieves 85.45% with 17.1M parameters. On CIFAR-100 dataset, SplitMixer achieves around 73% accuracy, on par with ConvMixer, but with about 52% fewer parameters and FLOPS. We hope that our results spark further research towards finding more efficient vision architectures and facilitate the development of MLP-like models. Code is available at https://github.com/aliborji/splitmixer.

preprint2021arXiv

Enhancing sensor resolution improves CNN accuracy given the same number of parameters or FLOPS

High image resolution is critical to obtain a good performance in many computer vision applications. Computational complexity of CNNs, however, grows significantly with the increase in input image size. Here, we show that it is almost always possible to modify a network such that it achieves higher accuracy at a higher input resolution while having the same number of parameters or/and FLOPS. The idea is similar to the EfficientNet paper but instead of optimizing network width, depth and resolution simultaneously, here we focus only on input resolution. This makes the search space much smaller which is more suitable for low computational budget regimes. More importantly, by controlling for the number of model parameters (and hence model capacity), we show that the additional benefit in accuracy is indeed due to the higher input resolution. Preliminary empirical investigation over MNIST, Fashion MNIST, and CIFAR10 datasets demonstrates the efficiency of the proposed approach.

preprint2020arXiv

Adversarial examples are useful too!

Deep learning has come a long way and has enjoyed an unprecedented success. Despite high accuracy, however, deep models are brittle and are easily fooled by imperceptible adversarial perturbations. In contrast to common inference-time attacks, Backdoor (\aka Trojan) attacks target the training phase of model construction, and are extremely difficult to combat since a) the model behaves normally on a pristine testing set and b) the augmented perturbations can be minute and may only affect few training samples. Here, I propose a new method to tell whether a model has been subject to a backdoor attack. The idea is to generate adversarial examples, targeted or untargeted, using conventional attacks such as FGSM and then feed them back to the classifier. By computing the statistics (here simply mean maps) of the images in different categories and comparing them with the statistics of a reference model, it is possible to visually locate the perturbed regions and unveil the attack.

preprint2020arXiv

DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction

This paper studies audio-visual deep saliency prediction. It introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed ``DAVE" in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we investigate the applicability of audio cues in conjunction with visual ones in predicting saliency maps using deep neural networks. To this end, the proposed model is intentionally designed to be simple. Two baseline models are developed on the same architecture which consists of an encoder-decoder. The encoder projects the input into a feature space followed by a decoder that infers saliency. We conduct an extensive analysis on different modalities and various aspects of multi-model dynamic saliency prediction. Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53.54\% of the frames predicted by the best Visual model (our baseline). Our endeavour demonstrates that audio is an important cue that boosts dynamic video saliency prediction and helps models to approach human performance. The code is available at https://github.com/hrtavakoli/DAVE

preprint2020arXiv

Empirical Upper Bound, Error Diagnosis and Invariance Analysis of Modern Object Detectors

Object detection remains as one of the most notorious open problems in computer vision. Despite large strides in accuracy in recent years, modern object detectors have started to saturate on popular benchmarks raising the question of how far we can reach with deep learning tools and tricks. Here, by employing 2 state-of-the-art object detection benchmarks, and analyzing more than 15 models over 4 large scale datasets, we I) carefully determine the upper bound in AP, which is 91.6% on VOC (test2007), 78.2% on COCO (val2017), and 58.9% on OpenImages V4 (validation), regardless of the IOU threshold. These numbers are much better than the mAP of the best model (47.9% on VOC, and 46.9% on COCO; IOUs=.5:.05:.95), II) characterize the sources of errors in object detectors, in a novel and intuitive way, and find that classification error (confusion with other classes and misses) explains the largest fraction of errors and weighs more than localization and duplicate errors, and III) analyze the invariance properties of models when surrounding context of an object is removed, when an object is placed in an incongruent background, and when images are blurred or flipped vertically. We find that models generate a lot of boxes on empty regions and that context is more important for detecting small objects than larger ones. Our work taps into the tight relationship between object detection and object recognition and offers insights for building better models. Our code is publicly available at https://github.com/aliborji/Deetctionupper bound.git.

preprint2020arXiv

Harnessing adversarial examples with a surprisingly simple defense

I introduce a very simple method to defend against adversarial examples. The basic idea is to raise the slope of the ReLU function at the test time. Experiments over MNIST and CIFAR-10 datasets demonstrate the effectiveness of the proposed defense against a number of strong attacks in both untargeted and targeted settings. While perhaps not as effective as the state of the art adversarial defenses, this approach can provide insights to understand and mitigate adversarial attacks. It can also be used in conjunction with other defenses.

preprint2020arXiv

ObjectNet Dataset: Reanalysis and Correction

Recently, Barbu et al introduced a dataset called ObjectNet which includes objects in daily life situations. They showed a dramatic performance drop of the state of the art object recognition models on this dataset. Due to the importance and implications of their results regarding generalization ability of deep models, we take a second look at their findings. We highlight a major problem with their work which is applying object recognizers to the scenes containing multiple objects rather than isolated objects. The latter results in around 20-30% performance gain using our code. Compared with the results reported in the ObjectNet paper, we observe that around 10-15 % of the performance loss can be recovered, without any test time data augmentation. In accordance with Barbu et al.'s conclusions, however, we also conclude that deep models suffer drastically on this dataset. Thus, we believe that ObjectNet remains a challenging dataset for testing the generalization power of models beyond datasets on which they have been trained.

preprint2016arXiv

Ego2Top: Matching Viewers in Egocentric and Top-view Videos

Egocentric cameras are becoming increasingly popular and provide us with large amounts of videos, captured from the first person perspective. At the same time, surveillance cameras and drones offer an abundance of visual information, often captured from top-view. Although these two sources of information have been separately studied in the past, they have not been collectively studied and related. Having a set of egocentric cameras and a top-view camera capturing the same area, we propose a framework to identify the egocentric viewers in the top-view video. We utilize two types of features for our assignment procedure. Unary features encode what a viewer (seen from top-view or recording an egocentric video) visually experiences over time. Pairwise features encode the relationship between the visual content of a pair of viewers. Modeling each view (egocentric or top) by a graph, the assignment process is formulated as spectral graph matching. Evaluating our method over a dataset of 50 top-view and 188 egocentric videos taken in different scenarios demonstrates the efficiency of the proposed approach in assigning egocentric viewers to identities present in top-view camera. We also study the effect of different parameters such as the number of egocentric viewers and visual features.

preprint2016arXiv

Egocentric Height Estimation

Egocentric, or first-person vision which became popular in recent years with an emerge in wearable technology, is different than exocentric (third-person) vision in some distinguishable ways, one of which being that the camera wearer is generally not visible in the video frames. Recent work has been done on action and object recognition in egocentric videos, as well as work on biometric extraction from first-person videos. Height estimation can be a useful feature for both soft-biometrics and object tracking. Here, we propose a method of estimating the height of an egocentric camera without any calibration or reference points. We used both traditional computer vision approaches and deep learning in order to determine the visual cues that results in best height estimation. Here, we introduce a framework inspired by two stream networks comprising of two Convolutional Neural Networks, one based on spatial information, and one based on information given by optical flow in a frame. Given an egocentric video as an input to the framework, our model yields a height estimate as an output. We also incorporate late fusion to learn a combination of temporal and spatial cues. Comparing our model with other methods we used as baselines, we achieve height estimates for videos with a Mean Average Error of 14.04 cm over a range of 103 cm of data, and classification accuracy for relative height (tall, medium or short) up to 93.75% where chance level is 33%.

preprint2016arXiv

Egocentric Meets Top-view

Thanks to the availability and increasing popularity of Egocentric cameras such as GoPro cameras, glasses, and etc. we have been provided with a plethora of videos captured from the first person perspective. Surveillance cameras and Unmanned Aerial Vehicles(also known as drones) also offer tremendous amount of videos, mostly with top-down or oblique view-point. Egocentric vision and top-view surveillance videos have been studied extensively in the past in the computer vision community. However, the relationship between the two has yet to be explored thoroughly. In this effort, we attempt to explore this relationship by approaching two questions. First, having a set of egocentric videos and a top-view video, can we verify if the top-view video contains all, or some of the egocentric viewers present in the egocentric set? And second, can we identify the egocentric viewers in the content of the top-view video? In other words, can we find the cameramen in the surveillance videos? These problems can become more challenging when the videos are not time-synchronous. Thus we formalize the problem in a way which handles and also estimates the unknown relative time-delays between the egocentric videos and the top-view video. We formulate the problem as a spectral graph matching instance, and jointly seek the optimal assignments and relative time-delays of the videos. As a result, we spatiotemporally localize the egocentric observers in the top-view video. We model each view (egocentric or top) using a graph, and compute the assignment and time-delays in an iterative-alternative fashion.

preprint2016arXiv

EgoTransfer: Transferring Motion Across Egocentric and Exocentric Domains using Deep Neural Networks

Mirror neurons have been observed in the primary motor cortex of primate species, in particular in humans and monkeys. A mirror neuron fires when a person performs a certain action, and also when he observes the same action being performed by another person. A crucial step towards building fully autonomous intelligent systems with human-like learning abilities is the capability in modeling the mirror neuron. On one hand, the abundance of egocentric cameras in the past few years has offered the opportunity to study a lot of vision problems from the first-person perspective. A great deal of interesting research has been done during the past few years, trying to explore various computer vision tasks from the perspective of the self. On the other hand, videos recorded by traditional static cameras, capture humans performing different actions from an exocentric third-person perspective. In this work, we take the first step towards relating motion information across these two perspectives. We train models that predict motion in an egocentric view, by observing it from an exocentric view, and vice versa. This allows models to predict how an egocentric motion would look like from outside. To do so, we train linear and nonlinear models and evaluate their performance in terms of retrieving the egocentric (exocentric) motion features, while having access to an exocentric (egocentric) motion feature. Our experimental results demonstrate that motion information can be successfully transferred across the two views.

preprint2016arXiv

Vanishing point attracts gaze in free-viewing and visual search tasks

To investigate whether the vanishing point (VP) plays a significant role in gaze guidance, we ran two experiments. In the first one, we recorded fixations of 10 observers (4 female; mean age 22; SD=0.84) freely viewing 532 images, out of which 319 had VP (shuffled presentation; each image for 4 secs). We found that the average number of fixations at a local region (80x80 pixels) centered at the VP is significantly higher than the average fixations at random locations (t-test; n=319; p=1.8e-35). To address the confounding factor of saliency, we learned a combined model of bottom-up saliency and VP. AUC score of our model (0.85; SD=0.01) is significantly higher than the original saliency model (e.g., 0.8 using AIM model by Bruce & Tsotsos (2009), t-test; p= 3.14e-16) and the VP-only model (0.64, t-test; p= 4.02e-22). In the second experiment, we asked 14 subjects (4 female, mean age 23.07, SD=1.26) to search for a target character (T or L) placed randomly on a 3x3 imaginary grid overlaid on top of an image. Subjects reported their answers by pressing one of two keys. Stimuli consisted of 270 color images (180 with a single VP, 90 without). The target happened with equal probability inside each cell (15 times L, 15 times T). We found that subjects were significantly faster (and more accurate) when target happened inside the cell containing the VP compared to cells without VP (median across 14 subjects 1.34 sec vs. 1.96; Wilcoxon rank-sum test; p = 0.0014). Response time at VP cells were also significantly lower than response time on images without VP (median 2.37; p= 4.77e-05). These findings support the hypothesis that vanishing point, similar to face and text (Cerf et al., 2009) as well as gaze direction (Borji et al., 2014) attracts attention in free-viewing and visual search.

preprint2016arXiv

Vanishing point detection with convolutional neural networks

Inspired by the finding that vanishing point (road tangent) guides driver's gaze, in our previous work we showed that vanishing point attracts gaze during free viewing of natural scenes as well as in visual search (Borji et al., Journal of Vision 2016). We have also introduced improved saliency models using vanishing point detectors (Feng et al., WACV 2016). Here, we aim to predict vanishing points in naturalistic environments by training convolutional neural networks in an end-to-end manner over a large set of road images downloaded from Youtube with vanishing points annotated. Results demonstrate effectiveness of our approach compared to classic approaches of vanishing point detection in the literature.

preprint2016arXiv

What can we learn about CNNs from a large scale controlled object dataset?

Tolerance to image variations (e.g. translation, scale, pose, illumination) is an important desired property of any object recognition system, be it human or machine. Moving towards increasingly bigger datasets has been trending in computer vision specially with the emergence of highly popular deep learning models. While being very useful for learning invariance to object inter- and intra-class shape variability, these large-scale wild datasets are not very useful for learning invariance to other parameters forcing researchers to resort to other tricks for training a model. In this work, we introduce a large-scale synthetic dataset, which is freely and publicly available, and use it to answer several fundamental questions regarding invariance and selectivity properties of convolutional neural networks. Our dataset contains two parts: a) objects shot on a turntable: 16 categories, 8 rotation angles, 11 cameras on a semicircular arch, 5 lighting conditions, 3 focus levels, variety of backgrounds (23.4 per instance) generating 1320 images per instance (over 20 million images in total), and b) scenes: in which a robot arm takes pictures of objects on a 1:160 scale scene. We study: 1) invariance and selectivity of different CNN layers, 2) knowledge transfer from one object category to another, 3) systematic or random sampling of images to build a train set, 4) domain adaptation from synthetic to natural scenes, and 5) order of knowledge delivery to CNNs. We also explore how our analyses can lead the field to develop more efficient CNNs.

preprint2015arXiv

CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research

Saliency modeling has been an active research area in computer vision for about two decades. Existing state of the art models perform very well in predicting where people look in natural scenes. There is, however, the risk that these models may have been overfitting themselves to available small scale biased datasets, thus trapping the progress in a local minimum. To gain a deeper insight regarding current issues in saliency modeling and to better gauge progress, we recorded eye movements of 120 observers while they freely viewed a large number of naturalistic and artificial images. Our stimuli includes 4000 images; 200 from each of 20 categories covering different types of scenes such as Cartoons, Art, Objects, Low resolution images, Indoor, Outdoor, Jumbled, Random, and Line drawings. We analyze some basic properties of this dataset and compare some successful models. We believe that our dataset opens new challenges for the next generation of saliency models and helps conduct behavioral studies on bottom-up visual attention.

preprint2015arXiv

Computational models of attention

This chapter reviews recent computational models of visual attention. We begin with models for the bottom-up or stimulus-driven guidance of attention to salient visual items, which we examine in seven different broad categories. We then examine more complex models which address the top-down or goal-oriented guidance of attention towards items that are more relevant to the task at hand.

preprint2015arXiv

Computational models: Bottom-up and top-down aspects

Computational models of visual attention have become popular over the past decade, we believe primarily for two reasons: First, models make testable predictions that can be explored by experimentalists as well as theoreticians, second, models have practical and technological applications of interest to the applied science and engineering communities. In this chapter, we take a critical look at recent attention modeling efforts. We focus on {\em computational models of attention} as defined by Tsotsos \& Rothenstein \shortcite{Tsotsos_Rothenstein11}: Models which can process any visual stimulus (typically, an image or video clip), which can possibly also be given some task definition, and which make predictions that can be compared to human or animal behavioral or physiological responses elicited by the same stimulus and task. Thus, we here place less emphasis on abstract models, phenomenological models, purely data-driven fitting or extrapolation models, or models specifically designed for a single task or for a restricted class of stimuli. For theoretical models, we refer the reader to a number of previous reviews that address attention theories and models more generally \cite{Itti_Koch01nrn,Paletta_etal05,Frintrop_etal10,Rothenstein_Tsotsos08,Gottlieb_Balan10,Toet11,Borji_Itti12pami}.

preprint2015arXiv

Fixation prediction with a combined model of bottom-up saliency and vanishing point

By predicting where humans look in natural scenes, we can understand how they perceive complex natural scenes and prioritize information for further high-level visual processing. Several models have been proposed for this purpose, yet there is a gap between best existing saliency models and human performance. While many researchers have developed purely computational models for fixation prediction, less attempts have been made to discover cognitive factors that guide gaze. Here, we study the effect of a particular type of scene structural information, known as the vanishing point, and show that human gaze is attracted to the vanishing point regions. We record eye movements of 10 observers over 532 images, out of which 319 have vanishing points. We then construct a combined model of traditional saliency and a vanishing point channel and show that our model outperforms state of the art saliency models using three scores on our dataset.

preprint2015arXiv

Reconciling saliency and object center-bias hypotheses in explaining free-viewing fixations

Predicting where people look in natural scenes has attracted a lot of interest in computer vision and computational neuroscience over the past two decades. Two seemingly contrasting categories of cues have been proposed to influence where people look: \textit{low-level image saliency} and \textit{high-level semantic information}. Our first contribution is to take a detailed look at these cues to confirm the hypothesis proposed by Henderson~\cite{henderson1993eye} and Nuthmann \& Henderson~\cite{nuthmann2010object} that observers tend to look at the center of objects. We analyzed fixation data for scene free-viewing over 17 observers on 60 fully annotated images with various types of objects. Images contained different types of scenes, such as natural scenes, line drawings, and 3D rendered scenes. Our second contribution is to propose a simple combined model of low-level saliency and object center-bias that outperforms each individual component significantly over our data, as well as on the OSIE dataset by Xu et al.~\cite{xu2014predicting}. The results reconcile saliency with object center-bias hypotheses and highlight that both types of cues are important in guiding fixations. Our work opens new directions to understand strategies that humans use in observing scenes and objects, and demonstrates the construction of combined models of low-level saliency and high-level object-based information.

preprint2015arXiv

Vanishing Point Attracts Eye Movements in Scene Free-viewing

Eye movements are crucial in understanding complex scenes. By predicting where humans look in natural scenes, we can understand how they percieve scenes and priotriaze information for further high-level processing. Here, we study the effect of a particular type of scene structural information known as vanishing point and show that human gaze is attracted to vanishing point regions. We then build a combined model of traditional saliency and vanishing point channel that outperforms state of the art saliency models.

preprint2014arXiv

What is a salient object? A dataset and a baseline model for salient object detection

Salient object detection or salient region detection models, diverging from fixation prediction models, have traditionally been dealing with locating and segmenting the most salient object or region in a scene. While the notion of most salient object is sensible when multiple objects exist in a scene, current datasets for evaluation of saliency detection approaches often have scenes with only one single object. We introduce three main contributions in this paper: First, we take an indepth look at the problem of salient object detection by studying the relationship between where people look in scenes and what they choose as the most salient object when they are explicitly asked. Based on the agreement between fixations and saliency judgments, we then suggest that the most salient object is the one that attracts the highest fraction of fixations. Second, we provide two new less biased benchmark datasets containing scenes with multiple objects that challenge existing saliency models. Indeed, we observed a severe drop in performance of 8 state-of-the-art models on our datasets (40% to 70%). Third, we propose a very simple yet powerful model based on superpixels to be used as a baseline for model evaluation and comparison. While on par with the best models on MSRA-5K dataset, our model wins over other models on our data highlighting a serious drawback of existing models, which is convoluting the processes of locating the most salient object and its segmentation. We also provide a review and statistical analysis of some labeled scene datasets that can be used for evaluating salient object detection models. We believe that our work can greatly help remedy the over-fitting of models to existing biased datasets and opens new venues for future research in this fast-evolving field.

Ali Borji

What is connected

Connect this record

See the researcher in context

Building this map preview

27 published item(s)

Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection

A New Kind of Adversarial Example

Complementary datasets to COCO for object detection

How good are deep models in understanding the generated images?

Is current research on adversarial robustness addressing the right problem?

Sensitivity of Average Precision to Bounding Box Perturbations

SplitMixer: Fat Trimmed From MLP-like Models

Enhancing sensor resolution improves CNN accuracy given the same number of parameters or FLOPS

Adversarial examples are useful too!

DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction

Empirical Upper Bound, Error Diagnosis and Invariance Analysis of Modern Object Detectors

Harnessing adversarial examples with a surprisingly simple defense

ObjectNet Dataset: Reanalysis and Correction

Ego2Top: Matching Viewers in Egocentric and Top-view Videos

Egocentric Height Estimation

Egocentric Meets Top-view

EgoTransfer: Transferring Motion Across Egocentric and Exocentric Domains using Deep Neural Networks

Vanishing point attracts gaze in free-viewing and visual search tasks

Vanishing point detection with convolutional neural networks

What can we learn about CNNs from a large scale controlled object dataset?

CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research

Computational models of attention

Computational models: Bottom-up and top-down aspects

Fixation prediction with a combined model of bottom-up saliency and vanishing point

Reconciling saliency and object center-bias hypotheses in explaining free-viewing fixations

Vanishing Point Attracts Eye Movements in Scene Free-viewing

What is a salient object? A dataset and a baseline model for salient object detection