Source author record

Anurag Mittal

Anurag Mittal appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Information Retrieval Artificial Intelligence Computation and Language Machine Learning

Catalog footprint

What is connected

9works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks

Video-based computer vision tasks can benefit from estimation of the salient regions and interactions between those regions. Traditionally, this has been done by identifying the object regions in the images by utilizing pre-trained models to perform object detection, object segmentation and/or object pose estimation. Although using pre-trained models is a viable approach, it has several limitations in the need for an exhaustive annotation of object categories, a possible domain gap between datasets, and a bias that is typically present in pre-trained models. In this work, we propose to utilize the common rationale that a sequence of video frames capture a set of common objects and interactions between them, thus a notion of co-segmentation between the video frame features may equip the model with the ability to automatically focus on task-specific salient regions and improve the underlying task's performance in an end-to-end manner. In this regard, we propose a generic module called ``Co-Segmentation inspired Attention Module'' (COSAM) that can be plugged in to any CNN model to promote the notion of co-segmentation based attention among a sequence of video frame features. We show the application of COSAM in three video-based tasks namely: 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification and demonstrate that COSAM is able to capture the task-specific salient regions in video frames, thus leading to notable performance improvements along with interpretable attention maps for a variety of video-based vision tasks, with possible application to other video-based vision tasks as well.

preprint2022arXiv

Non-linear Motion Estimation for Video Frame Interpolation using Space-time Convolutions

Video frame interpolation aims to synthesize one or multiple frames between two consecutive frames in a video. It has a wide range of applications including slow-motion video generation, frame-rate up-scaling and developing video codecs. Some older works tackled this problem by assuming per-pixel linear motion between video frames. However, objects often follow a non-linear motion pattern in the real domain and some recent methods attempt to model per-pixel motion by non-linear models (e.g., quadratic). A quadratic model can also be inaccurate, especially in the case of motion discontinuities over time (i.e. sudden jerks) and occlusions, where some of the flow information may be invalid or inaccurate. In our paper, we propose to approximate the per-pixel motion using a space-time convolution network that is able to adaptively select the motion model to be used. Specifically, we are able to softly switch between a linear and a quadratic model. Towards this end, we use an end-to-end 3D CNN encoder-decoder architecture over bidirectional optical flows and occlusion maps to estimate the non-linear motion model of each pixel. Further, a motion refinement module is employed to refine the non-linear motion and the interpolated frames are estimated by a simple warping of the neighboring frames with the estimated per-pixel motion. Through a set of comprehensive experiments, we validate the effectiveness of our model and show that our method outperforms state-of-the-art algorithms on four datasets (Vimeo, DAVIS, HD and GoPro).

preprint2020arXiv

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Recent studies have shown that current VQA models are heavily biased on the language priors in the train set to answer the question, irrespective of the image. E.g., overwhelmingly answer "what sport is" as "tennis" or "what color banana" as "yellow." This behavior restricts them from real-world application scenarios. In this work, we propose a novel model-agnostic question encoder, Visually-Grounded Question Encoder (VGQE), for VQA that reduces this effect. VGQE utilizes both visual and language modalities equally while encoding the question. Hence the question representation itself gets sufficient visual-grounding, and thus reduces the dependency of the model on the language priors. We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results on the bias-sensitive split of the VQAv2 dataset; VQA-CPv2. Further, unlike the existing bias-reduction techniques, on the standard VQAv2 benchmark, our approach does not drop the accuracy; instead, it improves the performance.

preprint2020arXiv

Stacked Adversarial Network for Zero-Shot Sketch based Image Retrieval

Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previously unseen classes during the test. This paper proposes a generative approach based on the Stacked Adversarial Network (SAN) and the advantage of Siamese Network (SN) for ZS-SBIR. While SAN generates a high-quality sample, SN learns a better distance metric compared to that of the nearest neighbor search. The capability of the generative model to synthesize image features based on the sketch reduces the SBIR problem to that of an image-to-image retrieval problem. We evaluate the efficacy of our proposed approach on TU-Berlin, and Sketchy database in both standard ZSL and generalized ZSL setting. The proposed method yields a significant improvement in standard ZSL as well as in a more challenging generalized ZSL setting (GZSL) for SBIR.

preprint2016arXiv

Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features

We propose a novel approach for First Impressions Recognition in terms of the Big Five personality-traits from short videos. The Big Five personality traits is a model to describe human personality using five broad categories: Extraversion, Agreeableness, Conscientiousness, Neuroticism and Openness. We train two bi-modal end-to-end deep neural network architectures using temporally ordered audio and novel stochastic visual features from few frames, without over-fitting. We empirically show that the trained models perform exceptionally well, even after training from a small sub-portions of inputs. Our method is evaluated in ChaLearn LAP 2016 Apparent Personality Analysis (APA) competition using ChaLearn LAP APA2016 dataset and achieved excellent performance.

preprint2015arXiv

Adaptive Locally Affine-Invariant Shape Matching

Matching deformable objects using their shapes is an important problem in computer vision since shape is perhaps the most distinguishable characteristic of an object. The problem is difficult due to many factors such as intra-class variations, local deformations, articulations, viewpoint changes and missed and extraneous contour portions due to errors in shape extraction. While small local deformations has been handled in the literature by allowing some leeway in the matching of individual contour points via methods such as Chamfer distance and Hausdorff distance, handling more severe deformations and articulations has been done by applying local geometric corrections such as similarity or affine. However, determining which portions of the shape should be used for the geometric corrections is very hard, although some methods have been tried. In this paper, we address this problem by an efficient search for the group of contour segments to be clustered together for a geometric correction using Dynamic Programming by essentially searching for the segmentations of two shapes that lead to the best matching between them. At the same time, we allow portions of the contours to remain unmatched to handle missing and extraneous contour portions. Experiments indicate that our method outperforms other algorithms, especially when the shapes to be matched are more complex.

preprint2015arXiv

Sketch-based Image Retrieval from Millions of Images under Rotation, Translation and Scale Variations

Proliferation of touch-based devices has made sketch-based image retrieval practical. While many methods exist for sketch-based object detection/image retrieval on small datasets, relatively less work has been done on large (web)-scale image retrieval. In this paper, we present an efficient approach for image retrieval from millions of images based on user-drawn sketches. Unlike existing methods for this problem which are sensitive to even translation or scale variations, our method handles rotation, translation, scale (i.e. a similarity transformation) and small deformations. The object boundaries are represented as chains of connected segments and the database images are pre-processed to obtain such chains that have a high chance of containing the object. This is accomplished using two approaches in this work: a) extracting long chains in contour segment networks and b) extracting boundaries of segmented object proposals. These chains are then represented by similarity-invariant variable length descriptors. Descriptor similarities are computed by a fast Dynamic Programming-based partial matching algorithm. This matching mechanism is used to generate a hierarchical k-medoids based indexing structure for the extracted chains of all database images in an offline process which is used to efficiently retrieve a small set of possible matched images for query chains. Finally, a geometric verification step is employed to test geometric consistency of multiple chain matches to improve results. Qualitative and quantitative results clearly demonstrate superiority of the approach over existing methods.

preprint2014arXiv

CoMIC: Good features for detection and matching at object boundaries

Feature or interest points typically use information aggregation in 2D patches which does not remain stable at object boundaries when there is object motion against a significantly varying background. Level or iso-intensity curves are much more stable under such conditions, especially the longer ones. In this paper, we identify stable portions on long iso-curves and detect corners on them. Further, the iso-curve associated with a corner is used to discard portions from the background and improve matching. Such CoMIC (Corners on Maximally-stable Iso-intensity Curves) points yield superior results at the object boundary regions compared to state-of-the-art detectors while performing comparably at the interior regions as well. This is illustrated in exhaustive matching experiments for both boundary and non-boundary regions in applications such as stereo and point tracking for structure from motion in video sequences.

preprint2014arXiv

Mixture of Parts Revisited: Expressive Part Interactions for Pose Estimation

Part-based models with restrictive tree-structured interactions for the Human Pose Estimation problem, leaves many part interactions unhandled. Two of the most common and strong manifestations of such unhandled interactions are self-occlusion among the parts and the confusion in the localization of the non-adjacent symmetric parts. By handling the self-occlusion in a data efficient manner, we improve the performance of the basic Mixture of Parts model by a large margin, especially on uncommon poses. Through addressing the confusion in the symmetric limb localization using a combination of two complementing trees, we improve the performance on all the parts by atmost doubling the running time. Finally, we show that the combination of the two solutions improves the results. We report results that are equivalent to the state-of-the-art on two standard datasets. Because of maintaining the tree-structured interactions and only part-level modeling of the base Mixture of Parts model, this is achieved in time that is much less than the best performing part-based model.

Anurag Mittal

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks

Non-linear Motion Estimation for Video Frame Interpolation using Space-time Convolutions

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Stacked Adversarial Network for Zero-Shot Sketch based Image Retrieval

Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features

Adaptive Locally Affine-Invariant Shape Matching

Sketch-based Image Retrieval from Millions of Images under Rotation, Translation and Scale Variations

CoMIC: Good features for detection and matching at object boundaries

Mixture of Parts Revisited: Expressive Part Interactions for Pose Estimation