Topic overview

Computer Vision

20554 works58054 researchers0 institutions

Topic snapshot

What this area looks like now

20554works
58054authors
0experts visible
0communities

Next steps

Move from topic reading into action

The graph preview below keeps the nearby papers, people and communities visible in the same reading flow.

Topic graph

See the topic as a live network

Open full explorer

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Papers in this area

24 featured work(s)

preprint2015arXiv

Separable time-causal and time-recursive spatio-temporal receptive fields

We present an improved model and theory for time-causal and time-recursive spatio-temporal receptive fields, obtained by a combination of Gaussian receptive fields over the spatial domain and first-order integrators or equivalently truncated exponential filters coupled in cascade over the temporal domain. Compared to previous spatio-temporal scale-space formulations in terms of non-enhancement of local extrema or scale invariance, these receptive fields are based on different scale-space axiomatics over time by ensuring non-creation of new local extrema or zero-crossings with increasing temporal scale. Specifically, extensions are presented about parameterizing the intermediate temporal scale levels, analysing the resulting temporal dynamics and transferring the theory to a discrete implementation in terms of recursive filters over time.

preprint2016arXiv

Density Modeling of Images using a Generalized Normalization Transformation

We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.

preprint2018arXiv

A Large-scale Attribute Dataset for Zero-shot Learning

Zero-Shot Learning (ZSL) has attracted huge research attention over the past few years; it aims to learn the new concepts that have never been seen before. In classical ZSL algorithms, attributes are introduced as the intermediate semantic representation to realize the knowledge transfer from seen classes to unseen classes. Previous ZSL algorithms are tested on several benchmark datasets annotated with attributes. However, these datasets are defective in terms of the image distribution and attribute diversity. In addition, we argue that the "co-occurrence bias problem" of existing datasets, which is caused by the biased co-occurrence of objects, significantly hinders models from correctly learning the concept. To overcome these problems, we propose a Large-scale Attribute Dataset (LAD). Our dataset has 78,017 images of 5 super-classes, 230 classes. The image number of LAD is larger than the sum of the four most popular attribute datasets. 359 attributes of visual, semantic and subjective properties are defined and annotated in instance-level. We analyze our dataset by conducting both supervised learning and zero-shot learning tasks. Seven state-of-the-art ZSL algorithms are tested on this new dataset. The experimental results reveal the challenge of implementing zero-shot learning on our dataset.

preprint2017arXiv

AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

Significant progress has been achieved in Computer Vision by leveraging large-scale image datasets. However, large-scale datasets for complex Computer Vision tasks beyond classification are still limited. This paper proposed a large-scale dataset named AIC (AI Challenger) with three sub-datasets, human keypoint detection (HKD), large-scale attribute dataset (LAD) and image Chinese captioning (ICC). In this dataset, we annotate class labels (LAD), keypoint coordinate (HKD), bounding box (HKD and LAD), attribute (LAD) and caption (ICC). These rich annotations bridge the semantic gap between low-level images and high-level concepts. The proposed dataset is an effective benchmark to evaluate and improve different computational methods. In addition, for related tasks, others can also use our dataset as a new resource to pre-train their models.

preprint2017arXiv

Normative theory of visual receptive fields

This article gives an overview of a normative computational theory of visual receptive fields, by which idealized functional models of early spatial, spatio-chromatic and spatio-temporal receptive fields can be derived in an axiomatic way based on structural properties of the environment in combination with assumptions about the internal structure of a vision system to guarantee consistent handling of image representations over multiple spatial and temporal scales. Interestingly, this theory leads to predictions about visual receptive field shapes with qualitatively very good similarity to biological receptive fields measured in the retina, the LGN and the primary visual cortex (V1) of mammals.

preprint2018arXiv

Object Activity Scene Description, Construction and Recognition

Action recognition is a critical task for social robots to meaningfully engage with their environment. 3D human skeleton-based action recognition is an attractive research area in recent years. Although, the existing approaches are good at action recognition, it is a great challenge to recognize a group of actions in an activity scene. To tackle this problem, at first, we partition the scene into several primitive actions (PAs) based upon motion attention mechanism. Then, the primitive actions are described by the trajectory vectors of corresponding joints. After that, motivated by text classification based on word embedding, we employ convolution neural network (CNN) to recognize activity scenes by considering motion of joints as "word" of activity. The experimental results on the scenes of human activity dataset show the efficiency of the proposed approach.

preprint2018arXiv

MSplit LBI: Realizing Feature Selection and Dense Estimation Simultaneously in Few-shot and Zero-shot Learning

It is one typical and general topic of learning a good embedding model to efficiently learn the representation coefficients between two spaces/subspaces. To solve this task, $L_{1}$ regularization is widely used for the pursuit of feature selection and avoiding overfitting, and yet the sparse estimation of features in $L_{1}$ regularization may cause the underfitting of training data. $L_{2}$ regularization is also frequently used, but it is a biased estimator. In this paper, we propose the idea that the features consist of three orthogonal parts, \emph{namely} sparse strong signals, dense weak signals and random noise, in which both strong and weak signals contribute to the fitting of data. To facilitate such novel decomposition, \emph{MSplit} LBI is for the first time proposed to realize feature selection and dense estimation simultaneously. We provide theoretical and simulational verification that our method exceeds $L_{1}$ and $L_{2}$ regularization, and extensive experimental results show that our method achieves state-of-the-art performance in the few-shot and zero-shot learning.

preprint2018arXiv

Computer-Aided Knee Joint Magnetic Resonance Image Segmentation - A Survey

Osteoarthritis (OA) is one of the major health issues among the elderly population. MRI is the most popular technology to observe and evaluate the progress of OA course. However, the extreme labor cost of MRI analysis makes the process inefficient and expensive. Also, due to human error and subjective nature, the inter- and intra-observer variability is rather high. Computer-aided knee MRI segmentation is currently an active research field because it can alleviate doctors and radiologists from the time consuming and tedious job, and improve the diagnosis performance which has immense potential for both clinic and scientific research. In the past decades, researchers have investigated automatic/semi-automatic knee MRI segmentation methods extensively. However, to the best of our knowledge, there is no comprehensive survey paper in this field yet. In this survey paper, we classify the existing methods by their principles and discuss the current research status and point out the future research trend in-depth.

preprint2018arXiv

Ballistocardiogram Signal Processing: A Literature Review

Time-domain algorithms are focused on detecting local maxima or local minima using a moving window, and therefore finding the interval between the dominant J-peaks of ballistocardiogram (BCG) signal. However, this approach has many limitations due to the nonlinear and nonstationary behavior of the BCG signal. This is because the BCG signal does not display consistent J-peaks, which can usually be the case for overnight, in-home monitoring, particularly with frail elderly. Additionally, its accuracy will be undoubtedly affected by motion artifacts. Second, frequency-domain algorithms do not provide information about interbeat intervals. Nevertheless, they can provide information about heart rate variability. This is usually done by taking the fast Fourier transform or the inverse Fourier transform of the logarithm of the estimated spectrum, i.e., cepstrum of the signal using a sliding window. Thereafter, the dominant frequency is obtained in a particular frequency range. The limit of these algorithms is that the peak in the spectrum may get wider and multiple peaks may appear, which might cause a problem in measuring the vital signs. At last, the objective of wavelet-domain algorithms is to decompose the signal into different components, hence the component which shows an agreement with the vital signs can be selected i.e., the selected component contains only information about the heart cycles or respiratory cycles, respectively. An empirical mode decomposition is an alternative approach to wavelet decomposition, and it is also a very suitable approach to cope with nonlinear and nonstationary signals such as cardiorespiratory signals. Apart from the above-mentioned algorithms, machine learning approaches have been implemented for measuring heartbeats. However, manual labeling of training data is a restricting property.

preprint2018arXiv

Learning Image Conditioned Label Space for Multilabel Classification

This work addresses the task of multilabel image classification. Inspired by the great success from deep convolutional neural networks (CNNs) for single-label visual-semantic embedding, we exploit extending these models for multilabel images. Specifically, we propose an image-dependent ranking model, which returns a ranked list of labels according to its relevance to the input image. In contrast to conventional CNN models that learn an image representation (i.e. the image embedding vector), the developed model learns a mapping (i.e. a transformation matrix) from an image in an attempt to differentiate between its relevant and irrelevant labels. Despite the conceptual simplicity of our approach, experimental results on a public benchmark dataset demonstrate that the proposed model achieves state-of-the-art performance while using fewer training images than other multilabel classification methods.

preprint2018arXiv

Accurate pedestrian localization in overhead depth images via Height-Augmented HOG

We tackle the challenge of reliably and automatically localizing pedestrians in real-life conditions through overhead depth imaging at unprecedented high-density conditions. Leveraging upon a combination of Histogram of Oriented Gradients-like feature descriptors, neural networks, data augmentation and custom data annotation strategies, this work contributes a robust and scalable machine learning-based localization algorithm, which delivers near-human localization performance in real-time, even with local pedestrian density of about 3 ped/m2, a case in which most state-of-the art algorithms degrade significantly in performance.

preprint2017arXiv

Zero-Shot Learning posed as a Missing Data Problem

This paper presents a method of zero-shot learning (ZSL) which poses ZSL as the missing data problem, rather than the missing label problem. Specifically, most existing ZSL methods focus on learning mapping functions from the image feature space to the label embedding space. Whereas, the proposed method explores a simple yet effective transductive framework in the reverse way \--- our method estimates data distribution of unseen classes in the image feature space by transferring knowledge from the label embedding space. In experiments, our method outperforms the state-of-the-art on two popular datasets.

preprint2016arXiv

Near-Infrared Coloring via a Contrast-Preserving Mapping Model

Near-infrared gray images captured together with corresponding visible color images have recently proven useful for image restoration and classification. This paper introduces a new coloring method to add colors to near-infrared gray images based on a contrast-preserving mapping model. A naive coloring method directly adds the colors from the visible color image to the near-infrared gray image; however, this method results in an unrealistic image because of the discrepancies in brightness and image structure between the captured near-infrared gray image and the visible color image. To solve the discrepancy problem, first we present a new contrast-preserving mapping model to create a new near-infrared gray image with a similar appearance in the luminance plane to the visible color image, while preserving the contrast and details of the captured near-infrared gray image. Then based on the proposed contrast-preserving mapping model, we develop a method to derive realistic colors that can be added to the newly created near-infrared gray image. Experimental results show that the proposed method can not only preserve the local contrasts and details of the captured near-infrared gray image, but transfers the realistic colors from the visible color image to the newly created near-infrared gray image. Experimental results also show that the proposed approach can be applied to near-infrared denoising.

preprint2018arXiv

Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results (Springmann and Lüdeling, 2016, Reul et al., 2017a). To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient (Reul et al., 2018a, Reul et al., 2018b). Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-ShortTerm-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares.

preprint2018arXiv

Attention-based Active Visual Search for Mobile Robots

We present an active visual search model for finding objects in unknown environments. The proposed algorithm guides the robot towards the sought object using the relevant stimuli provided by the visual sensors. Existing search strategies are either purely reactive or use simplified sensor models that do not exploit all the visual information available. In this paper, we propose a new model that actively extracts visual information via visual attention techniques and, in conjunction with a non-myopic decision-making algorithm, leads the robot to search more relevant areas of the environment. The attention module couples both top-down and bottom-up attention models enabling the robot to search regions with higher importance first. The proposed algorithm is evaluated on a mobile robot platform in a 3D simulated environment. The results indicate that the use of visual attention significantly improves search, but the degree of improvement depends on the nature of the task and the complexity of the environment. In our experiments, we found that performance enhancements of up to 42\% in structured and 38\% in highly unstructured cluttered environments can be achieved using visual attention mechanisms.

preprint2019arXiv

Supervised Learning of the Next-Best-View for 3D Object Reconstruction

Motivated by the advances in 3D sensing technology and the spreading of low-cost robotic platforms, 3D object reconstruction has become a common task in many areas. Nevertheless, the selection of the optimal sensor pose that maximizes the reconstructed surface is a problem that remains open. It is known in the literature as the next-best-view planning problem. In this paper, we propose a novel next-best-view planning scheme based on supervised deep learning. The scheme contains an algorithm for automatic generation of datasets and an original three-dimensional convolutional neural network (3D-CNN) used to learn the next-best-view. Unlike previous work where the problem is addressed as a search, the trained 3D-CNN directly predicts the sensor pose. We present a comparison of the proposed network against a similar net, and we present several experiments of the reconstruction of unknown objects validating the effectiveness of the proposed scheme.

preprint2019arXiv

Learning Occlusion-Aware View Synthesis for Light Fields

In this work, we present a novel learning-based approach to synthesize new views of a light field image. In particular, given the four corner views of a light field, the presented method estimates any in-between view. We use three sequential convolutional neural networks for feature extraction, scene geometry estimation and view selection. Compared to state-of-the-art approaches, in order to handle occlusions we propose to estimate a different disparity map per view. Jointly with the view selection network, this strategy shows to be the most important to have proper reconstructions near object boundaries. Ablation studies and comparison against the state of the art on Lytro light fields show the superior performance of the proposed method. Furthermore, the method is adapted and tested on light fields with wide baselines acquired with a camera array and, in spite of having to deal with large occluded areas, the proposed approach yields very promising results.

preprint2019arXiv

Pixel-wise Regression: 3D Hand Pose Estimation via Spatial-form Representation and Differentiable Decoder

3D Hand pose estimation from a single depth image is an essential topic in computer vision and human-computer interaction. Although the rising of deep learning method boosts the accuracy a lot, the problem is still hard to solve due to the complex structure of the human hand. Existing methods with deep learning either lose spatial information of hand structure or lack a direct supervision of joint coordinates. In this paper, we propose a novel Pixel-wise Regression method, which use spatial-form representation (SFR) and differentiable decoder (DD) to solve the two problems. To use our method, we build a model, in which we design a particular SFR and its correlative DD which divided the 3D joint coordinates into two parts, plane coordinates and depth coordinates and use two modules named Plane Regression (PR) and Depth Regression (DR) to deal with them respectively. We conduct an ablation experiment to show the method we proposed achieve better results than the former methods. We also make an exploration on how different training strategies influence the learned SFRs and results. The experiment on three public datasets demonstrates that our model is comparable with the existing state-of-the-art models and in one of them our model can reduce mean 3D joint error by 25%.

preprint2019arXiv

OCGAN: One-class Novelty Detection Using GANs with Constrained Latent Representations

We present a novel model called OCGAN for the classical problem of one-class novelty detection, where, given a set of examples from a particular class, the goal is to determine if a query example is from the same class. Our solution is based on learning latent representations of in-class examples using a denoising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder's output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real. Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.

preprint2019arXiv

DeepSphere: Efficient spherical Convolutional Neural Network with HEALPix sampling for cosmological applications

Convolutional Neural Networks (CNNs) are a cornerstone of the Deep Learning toolbox and have led to many breakthroughs in Artificial Intelligence. These networks have mostly been developed for regular Euclidean domains such as those supporting images, audio, or video. Because of their success, CNN-based methods are becoming increasingly popular in Cosmology. Cosmological data often comes as spherical maps, which make the use of the traditional CNNs more complicated. The commonly used pixelization scheme for spherical maps is the Hierarchical Equal Area isoLatitude Pixelisation (HEALPix). We present a spherical CNN for analysis of full and partial HEALPix maps, which we call DeepSphere. The spherical CNN is constructed by representing the sphere as a graph. Graphs are versatile data structures that can act as a discrete representation of a continuous manifold. Using the graph-based representation, we define many of the standard CNN operations, such as convolution and pooling. With filters restricted to being radial, our convolutions are equivariant to rotation on the sphere, and DeepSphere can be made invariant or equivariant to rotation. This way, DeepSphere is a special case of a graph CNN, tailored to the HEALPix sampling of the sphere. This approach is computationally more efficient than using spherical harmonics to perform convolutions. We demonstrate the method on a classification problem of weak lensing mass maps from two cosmological models and compare the performance of the CNN with that of two baseline classifiers. The results show that the performance of DeepSphere is always superior or equal to both of these baselines. For high noise levels and for data covering only a smaller fraction of the sphere, DeepSphere achieves typically 10% better classification accuracy than those baselines. Finally, we show how learned filters can be visualized to introspect the neural network.

preprint2018arXiv

Generative Probabilistic Novelty Detection with Adversarial Autoencoders

Novelty detection is the problem of identifying whether a new data point is considered to be an inlier or an outlier. We assume that training data is available to describe only the inlier distribution. Recent approaches primarily leverage deep encoder-decoder network architectures to compute a reconstruction error that is used to either compute a novelty score or to train a one-class classifier. While we too leverage a novel network of that kind, we take a probabilistic approach and effectively compute how likely is that a sample was generated by the inlier distribution. We achieve this with two main contributions. First, we make the computation of the novelty probability feasible because we linearize the parameterized manifold capturing the underlying structure of the inlier distribution, and show how the probability factorizes and can be computed with respect to local coordinates of the manifold tangent space. Second, we improved the training of the autoencoder network. An extensive set of results show that the approach achieves state-of-the-art results on several benchmark datasets.

preprint2019arXiv

Smooth Adversarial Examples

This paper investigates the visual quality of the adversarial examples. Recent papers propose to smooth the perturbations to get rid of high frequency artefacts. In this work, smoothing has a different meaning as it perceptually shapes the perturbation according to the visual content of the image to be attacked. The perturbation becomes locally smooth on the flat areas of the input image, but it may be noisy on its textured areas and sharp across its edges. This operation relies on Laplacian smoothing, well-known in graph signal processing, which we integrate in the attack pipeline. We benchmark several attacks with and without smoothing under a white-box scenario and evaluate their transferability. Despite the additional constraint of smoothness, our attack has the same probability of success at lower distortion.

preprint2018arXiv

Cross-scale predictive dictionaries

Sparse representations using data dictionaries provide an efficient model particularly for signals that do not enjoy alternate analytic sparsifying transformations. However, solving inverse problems with sparsifying dictionaries can be computationally expensive, especially when the dictionary under consideration has a large number of atoms. In this paper, we incorporate additional structure on to dictionary-based sparse representations for visual signals to enable speedups when solving sparse approximation problems. The specific structure that we endow onto sparse models is that of a multi-scale modeling where the sparse representation at each scale is constrained by the sparse representation at coarser scales. We show that this cross-scale predictive model delivers significant speedups, often in the range of 10-60$\times$, with little loss in accuracy for linear inverse problems associated with images, videos, and light fields.

preprint2019arXiv

Programmable Spectrometry -- Per-pixel Classification of Materials using Learned Spectral Filters

Many materials have distinct spectral profiles. This facilitates estimation of the material composition of a scene at each pixel by first acquiring its hyperspectral image, and subsequently filtering it using a bank of spectral profiles. This process is inherently wasteful since only a set of linear projections of the acquired measurements contribute to the classification task. We propose a novel programmable camera that is capable of producing images of a scene with an arbitrary spectral filter. We use this camera to optically implement the spectral filtering of the scene's hyperspectral image with the bank of spectral profiles needed to perform per-pixel material classification. This provides gains both in terms of acquisition speed --- since only the relevant measurements are acquired --- and in signal-to-noise ratio --- since we invariably avoid narrowband filters that are light inefficient. Given training data, we use a range of classical and modern techniques including SVMs and neural networks to identify the bank of spectral profiles that facilitate material classification. We verify the method in simulations on standard datasets as well as real data using a lab prototype of the camera.

People in this topic

12 visible researcher(s)