Source author record

Rama Chellappa

Rama Chellappa appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning eess.IV Artificial Intelligence Cryptography and Security Databases Information Theory math.IT Robotics

Catalog footprint

What is connected

47works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting

Adaptive density control in 3D Gaussian Splatting (3DGS) repeatedly grows the Gaussian population through fixed-cardinality random splitting to discover useful scene structure. However, in vanilla 3DGS, its binary split operator requires many densification rounds to expose fine details, making it a bottleneck for efficient training schedules with fewer iterations. We introduce AdpSplit, an error-driven adaptive split operator that determines the number of split children and initializes the child parameters from L1-pixel-error region statistics, enabling fewer densification iterations, thus reduced training time, while preserving the rendering quality of full-schedule training. Across the MipNeRF360, Deep-Blending, and Tanks&Temples datasets, AdpSplit reduces the training time of multiple accelerated 3DGS pipelines by 9.2%-22.3% as a simple drop-in replacement for the standard split operator. With FastGS, AdpSplit matches the full-schedule PSNR on MipNeRF360 while reducing training time by 16.4%, corresponding to a 12.6x acceleration over vanilla 3DGS.

preprint2026arXiv

AttriBE: Quantifying Attribute Expressivity in Body Embeddings for Recognition and Identification

Person re-identification (ReID) systems that match individuals across images or video frames are essential in many real-world applications. However, existing methods are often influenced by attributes such as gender, pose, and body mass index (BMI), which vary in unconstrained settings and raise concerns related to fairness and generalization. To address this, we extend the notion of expressivity, defined as the mutual information between learned features and specific attributes, using a secondary neural network to quantify how strongly attributes are encoded. Applying this framework to three transformer-based ReID models on a large-scale visible-spectrum dataset, we find that BMI consistently shows the highest expressivity in deeper layers. Attributes in the final representation are ranked as BMI > Pitch > Gender > Yaw, and expressivity evolves across layers and training epochs, with pose peaking in intermediate layers and BMI strengthening with depth. We further extend the analysis to cross-spectral person identification across infrared modalities including short-wave, medium-wave, and long-wave infrared. In this setting, pitch becomes comparable to BMI and attribute trends increase monotonically across depth, suggesting increased reliance on structural cues when bridging modality gaps. Overall, the results show that transformer-based ReID embeddings encode a hierarchy of implicit attributes, with morphometric information persistently embedded and pose contributing more strongly under cross-spectral conditions.

preprint2026arXiv

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.

preprint2026arXiv

SciFig: Towards Automating Scientific Figure Generation

Creating high-quality figures and visualizations for scientific papers is a time-consuming task that requires both deep domain knowledge and professional design skills. Despite over 2.5 million scientific papers published annually, the figure generation process remains largely manual. We introduce $\textbf{SciFig}$, an end-to-end AI agent system that generates publication-ready pipeline figures directly from research paper texts. SciFig uses a hierarchical layout generation strategy, which parses research descriptions to identify component relationships, groups related elements into functional modules, and generates inter-module connections to establish visual organization. Furthermore, an iterative chain-of-thought (CoT) feedback mechanism progressively improves layouts through multiple rounds of visual analysis and reasoning. We introduce a rubric-based evaluation framework that analyzes 2,219 real scientific figures to extract evaluation rubrics and automatically generates comprehensive evaluation criteria. SciFig demonstrates remarkable performance: achieving 70.1$\%$ overall quality on dataset-level evaluation and 66.2$\%$ on paper-specific evaluation, and consistently high scores across metrics such as visual clarity, structural organization, and scientific accuracy. SciFig figure generation pipeline and our evaluation benchmark will be open-sourced.

preprint2022arXiv

Distill and De-bias: Mitigating Bias in Face Verification using Knowledge Distillation

Face recognition networks generally demonstrate bias with respect to sensitive attributes like gender, skintone etc. For gender and skintone, we observe that the regions of the face that a network attends to vary by the category of an attribute. This might contribute to bias. Building on this intuition, we propose a novel distillation-based approach called Distill and De-bias (D&D) to enforce a network to attend to similar face regions, irrespective of the attribute category. In D&D, we train a teacher network on images from one category of an attribute; e.g. light skintone. Then distilling information from the teacher, we train a student network on images of the remaining category; e.g., dark skintone. A feature-level distillation loss constrains the student network to generate teacher-like representations. This allows the student network to attend to similar face regions for all attribute categories and enables it to reduce bias. We also propose a second distillation step on top of D&D, called D&D++. Here, we distill the `un-biasedness' of the D&D network into a new student network, the D&D++ network, while training this new network on all attribute categories; e.g., both light and dark skintones. This helps us train a network that is less biased for an attribute, while obtaining higher face verification performance than D&D. We show that D&D++ outperforms existing baselines in reducing gender and skintone bias on the IJB-C dataset, while obtaining higher face verification performance than existing adversarial de-biasing methods. We evaluate the effectiveness of our proposed methods on two state-of-the-art face recognition networks: ArcFace and Crystalface.

preprint2022arXiv

HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet

Semantic segmentation of 3D medical images is a challenging task due to the high variability of the shape and pattern of objects (such as organs or tumors). Given the recent success of deep learning in medical image segmentation, Neural Architecture Search (NAS) has been introduced to find high-performance 3D segmentation network architectures. However, because of the massive computational requirements of 3D data and the discrete optimization nature of architecture search, previous NAS methods require a long search time or necessary continuous relaxation, and commonly lead to sub-optimal network architectures. While one-shot NAS can potentially address these disadvantages, its application in the segmentation domain has not been well studied in the expansive multi-scale multi-path search space. To enable one-shot NAS for medical image segmentation, our method, named HyperSegNAS, introduces a HyperNet to assist super-net training by incorporating architecture topology information. Such a HyperNet can be removed once the super-net is trained and introduces no overhead during architecture search. We show that HyperSegNAS yields better performing and more intuitive architectures compared to the previous state-of-the-art (SOTA) segmentation networks; furthermore, it can quickly and accurately find good architecture candidates under different computing constraints. Our method is evaluated on public datasets from the Medical Segmentation Decathlon (MSD) challenge, and achieves SOTA performances.

preprint2022arXiv

PDRF: Progressively Deblurring Radiance Field for Fast and Robust Scene Reconstruction from Blurry Images

We present Progressively Deblurring Radiance Field (PDRF), a novel approach to efficiently reconstruct high quality radiance fields from blurry images. While current State-of-The-Art (SoTA) scene reconstruction methods achieve photo-realistic rendering results from clean source views, their performances suffer when the source views are affected by blur, which is commonly observed for images in the wild. Previous deblurring methods either do not account for 3D geometry, or are computationally intense. To addresses these issues, PDRF, a progressively deblurring scheme in radiance field modeling, accurately models blur by incorporating 3D scene context. PDRF further uses an efficient importance sampling scheme, which results in fast scene optimization. Specifically, PDRF proposes a Coarse Ray Renderer to quickly estimate voxel density and feature; a Fine Voxel Renderer is then used to achieve high quality ray tracing. We perform extensive experiments and show that PDRF is 15X faster than previous SoTA while achieving better performance on both synthetic and real scenes.

preprint2022arXiv

REGAS: REspiratory-GAted Synthesis of Views for Multi-Phase CBCT Reconstruction from a single 3D CBCT Acquisition

It is a long-standing challenge to reconstruct Cone Beam Computed Tomography (CBCT) of the lung under respiratory motion. This work takes a step further to address a challenging setting in reconstructing a multi-phase}4D lung image from just a single}3D CBCT acquisition. To this end, we introduce REpiratory-GAted Synthesis of views, or REGAS. REGAS proposes a self-supervised method to synthesize the undersampled tomographic views and mitigate aliasing artifacts in reconstructed images. This method allows a much better estimation of between-phase Deformation Vector Fields (DVFs), which are used to enhance reconstruction quality from direct observations without synthesis. To address the large memory cost of deep neural networks on high resolution 4D data, REGAS introduces a novel Ray Path Transformation (RPT) that allows for distributed, differentiable forward projections. REGAS require no additional measurements like prior scans, air-flow volume, or breathing velocity. Our extensive experiments show that REGAS significantly outperforms comparable methods in quantitative metrics and visual quality.

preprint2022arXiv

Scalable and Real-time Multi-Camera Vehicle Detection, Re-Identification, and Tracking

Multi-camera vehicle tracking is one of the most complicated tasks in Computer Vision as it involves distinct tasks including Vehicle Detection, Tracking, and Re-identification. Despite the challenges, multi-camera vehicle tracking has immense potential in transportation applications including speed, volume, origin-destination (O-D), and routing data generation. Several recent works have addressed the multi-camera tracking problem. However, most of the effort has gone towards improving accuracy on high-quality benchmark datasets while disregarding lower camera resolutions, compression artifacts and the overwhelming amount of computational power and time needed to carry out this task on its edge and thus making it prohibitive for large-scale and real-time deployment. Therefore, in this work we shed light on practical issues that should be addressed for the design of a multi-camera tracking system to provide actionable and timely insights. Moreover, we propose a real-time city-scale multi-camera vehicle tracking system that compares favorably to computationally intensive alternatives and handles real-world, low-resolution CCTV instead of idealized and curated video streams. To show its effectiveness, in addition to integration into the Regional Integrated Transportation Information System (RITIS), we participated in the 2021 NVIDIA AI City multi-camera tracking challenge and our method is ranked among the top five performers on the public leaderboard.

preprint2022arXiv

Scalable Vehicle Re-Identification via Self-Supervision

As Computer Vision technologies become more mature for intelligent transportation applications, it is time to ask how efficient and scalable they are for large-scale and real-time deployment. Among these technologies is Vehicle Re-Identification which is one of the key elements in city-scale vehicle analytics systems. Many state-of-the-art solutions for vehicle re-id mostly focus on improving the accuracy on existing re-id benchmarks and often ignore computational complexity. To balance the demands of accuracy and computational efficiency, in this work we propose a simple yet effective hybrid solution empowered by self-supervised training which only uses a single network during inference time and is free of intricate and computation-demanding add-on modules often seen in state-of-the-art approaches. Through extensive experiments, we show our approach, termed Self-Supervised and Boosted VEhicle Re-Identification (SSBVER), is on par with state-of-the-art alternatives in terms of accuracy without introducing any additional overhead during deployment. Additionally we show that our approach, generalizes to different backbone architectures which facilitates various resource constraints and consistently results in a significant accuracy boost.

preprint2022arXiv

Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection

Object detection plays a key role in many security-critical systems. Adversarial patch attacks, which are easy to implement in the physical world, pose a serious threat to state-of-the-art object detectors. Developing reliable defenses for object detectors against patch attacks is critical but severely understudied. In this paper, we propose Segment and Complete defense (SAC), a general framework for defending object detectors against patch attacks through detection and removal of adversarial patches. We first train a patch segmenter that outputs patch masks which provide pixel-level localization of adversarial patches. We then propose a self adversarial training algorithm to robustify the patch segmenter. In addition, we design a robust shape completion algorithm, which is guaranteed to remove the entire patch from the images if the outputs of the patch segmenter are within a certain Hamming distance of the ground-truth patch masks. Our experiments on COCO and xView datasets demonstrate that SAC achieves superior robustness even under strong adaptive attacks with no reduction in performance on clean images, and generalizes well to unseen patch shapes, attack budgets, and unseen attack methods. Furthermore, we present the APRICOT-Mask dataset, which augments the APRICOT dataset with pixel-level annotations of adversarial patches. We show SAC can significantly reduce the targeted attack success rate of physical patch attacks. Our code is available at https://github.com/joellliu/SegmentAndComplete.

preprint2022arXiv

The 6th AI City Challenge

The 6th edition of the AI City Challenge specifically focuses on problems in two domains where there is tremendous unlocked potential at the intersection of computer vision and artificial intelligence: Intelligent Traffic Systems (ITS), and brick and mortar retail businesses. The four challenge tracks of the 2022 AI City Challenge received participation requests from 254 teams across 27 countries. Track 1 addressed city-scale multi-target multi-camera (MTMC) vehicle tracking. Track 2 addressed natural-language-based vehicle track retrieval. Track 3 was a brand new track for naturalistic driving analysis, where the data were captured by several cameras mounted inside the vehicle focusing on driver safety, and the task was to classify driver actions. Track 4 was another new track aiming to achieve retail store automated checkout using only a single view camera. We released two leader boards for submissions based on different methods, including a public leader board for the contest, where no use of external data is allowed, and a general leader board for all submitted results. The top performance of participating teams established strong baselines and even outperformed the state-of-the-art in the proposed challenge tracks.

preprint2022arXiv

Towards performant and reliable undersampled MR reconstruction via diffusion model sampling

Magnetic Resonance (MR) image reconstruction from under-sampled acquisition promises faster scanning time. To this end, current State-of-The-Art (SoTA) approaches leverage deep neural networks and supervised training to learn a recovery model. While these approaches achieve impressive performances, the learned model can be fragile on unseen degradation, e.g. when given a different acceleration factor. These methods are also generally deterministic and provide a single solution to an ill-posed problem; as such, it can be difficult for practitioners to understand the reliability of the reconstruction. We introduce DiffuseRecon, a novel diffusion model-based MR reconstruction method. DiffuseRecon guides the generation process based on the observed signals and a pre-trained diffusion model, and does not require additional training on specific acceleration factors. DiffuseRecon is stochastic in nature and generates results from a distribution of fully-sampled MR images; as such, it allows us to explicitly visualize different potential reconstruction solutions. Lastly, DiffuseRecon proposes an accelerated, coarse-to-fine Monte-Carlo sampling scheme to approximate the most likely reconstruction candidate. The proposed DiffuseRecon achieves SoTA performances reconstructing from raw acquisition signals in fastMRI and SKM-TEA. Code will be open-sourced at www.github.com/cpeng93/DiffuseRecon.

preprint2022arXiv

XraySyn: Realistic View Synthesis From a Single Radiograph Through CT Priors

A radiograph visualizes the internal anatomy of a patient through the use of X-ray, which projects 3D information onto a 2D plane. Hence, radiograph analysis naturally requires physicians to relate the prior about 3D human anatomy to 2D radiographs. Synthesizing novel radiographic views in a small range can assist physicians in interpreting anatomy more reliably; however, radiograph view synthesis is heavily ill-posed, lacking in paired data, and lacking in differentiable operations to leverage learning-based approaches. To address these problems, we use Computed Tomography (CT) for radiograph simulation and design a differentiable projection algorithm, which enables us to achieve geometrically consistent transformations between the radiography and CT domains. Our method, XraySyn, can synthesize novel views on real radiographs through a combination of realistic simulation and finetuning on real radiographs. To the best of our knowledge, this is the first work on radiograph view synthesis. We show that by gaining an understanding of radiography in 3D space, our method can be applied to radiograph bone extraction and suppression without groundtruth bone labels.

preprint2020arXiv

3DRegNet: A Deep Neural Network for 3D Point Registration

We present 3DRegNet, a novel deep learning architecture for the registration of 3D scans. Given a set of 3D point correspondences, we build a deep neural network to address the following two challenges: (i) classification of the point correspondences into inliers/outliers, and (ii) regression of the motion parameters that align the scans into a common reference frame. With regard to regression, we present two alternative approaches: (i) a Deep Neural Network (DNN) registration and (ii) a Procrustes approach using SVD to estimate the transformation. Our correspondence-based approach achieves a higher speedup compared to competing baselines. We further propose the use of a refinement network, which consists of a smaller 3DRegNet as a refinement to improve the accuracy of the registration. Extensive experiments on two challenging datasets demonstrate that we outperform other methods and achieve state-of-the-art results. The code is available.

preprint2020arXiv

ATFaceGAN: Single Face Image Restoration and Recognition from Atmospheric Turbulence

Image degradation due to atmospheric turbulence is common while capturing images at long ranges. To mitigate the degradation due to turbulence which includes deformation and blur, we propose a generative single frame restoration algorithm which disentangles the blur and deformation due to turbulence and reconstructs a restored image. The disentanglement is achieved by decomposing the distortion due to turbulence into blur and deformation components using deblur generator and deformation correction generator respectively. Two paths of restoration are implemented to regularize the disentanglement and generate two restored images from one degraded image. A fusion function combines the features of the restored images to reconstruct a sharp image with rich details. Adversarial and perceptual losses are added to reconstruct a sharp image and suppress the artifacts respectively. Extensive experiments demonstrate the effectiveness of the proposed restoration algorithm, which achieves satisfactory performance in face restoration and face recognition.

preprint2020arXiv

Detecting Human-Object Interactions via Functional Generalization

We present an approach for detecting human-object interactions (HOIs) in images, based on the idea that humans interact with functionally similar objects in a similar manner. The proposed model is simple and efficiently uses the data, visual features of the human, relative spatial orientation of the human and the object, and the knowledge that functionally similar objects take part in similar interactions with humans. We provide extensive experimental validation for our approach and demonstrate state-of-the-art results for HOI detection. On the HICO-Det dataset our method achieves a gain of over 2.5% absolute points in mean average precision (mAP) over state-of-the-art. We also show that our approach leads to significant performance gains for zero-shot HOI detection in the seen object setting. We further demonstrate that using a generic object detector, our model can generalize to interactions involving previously unseen objects.

preprint2020arXiv

Dual Manifold Adversarial Robustness: Defense against Lp and non-Lp Adversarial Attacks

Adversarial training is a popular defense strategy against attack threat models with bounded Lp norms. However, it often degrades the model performance on normal images and the defense does not generalize well to novel attacks. Given the success of deep generative models such as GANs and VAEs in characterizing the underlying manifold of images, we investigate whether or not the aforementioned problems can be remedied by exploiting the underlying manifold information. To this end, we construct an "On-Manifold ImageNet" (OM-ImageNet) dataset by projecting the ImageNet samples onto the manifold learned by StyleGSN. For this dataset, the underlying manifold information is exact. Using OM-ImageNet, we first show that adversarial training in the latent space of images improves both standard accuracy and robustness to on-manifold attacks. However, since no out-of-manifold perturbations are realized, the defense can be broken by Lp adversarial attacks. We further propose Dual Manifold Adversarial Training (DMAT) where adversarial perturbations in both latent and image spaces are used in robustifying the model. Our DMAT improves performance on normal images, and achieves comparable robustness to the standard adversarial training against Lp attacks. In addition, we observe that models defended by DMAT achieve improved robustness against novel attacks which manipulate images by global color shifts or various types of image filtering. Interestingly, similar improvements are also achieved when the defended models are tested on out-of-manifold natural images. These results demonstrate the potential benefits of using manifold information in enhancing robustness of deep learning models against various types of novel adversarial attacks.

preprint2020arXiv

Occlusion-Adaptive Deep Network for Robust Facial Expression Recognition

Recognizing the expressions of partially occluded faces is a challenging computer vision problem. Previous expression recognition methods, either overlooked this issue or resolved it using extreme assumptions. Motivated by the fact that the human visual system is adept at ignoring the occlusion and focus on non-occluded facial areas, we propose a landmark-guided attention branch to find and discard corrupted features from occluded regions so that they are not used for recognition. An attention map is first generated to indicate if a specific facial part is occluded and guide our model to attend to non-occluded regions. To further improve robustness, we propose a facial region branch to partition the feature maps into non-overlapping facial blocks and task each block to predict the expression independently. This results in more diverse and discriminative features, enabling the expression recognition system to recover even though the face is partially occluded. Depending on the synergistic effects of the two branches, our occlusion-adaptive deep network significantly outperforms state-of-the-art methods on two challenging in-the-wild benchmark datasets and three real-world occluded expression datasets.

preprint2020arXiv

Recognizing Families In the Wild: White Paper for the 4th Edition Data Challenge

Recognizing Families In the Wild (RFIW): an annual large-scale, multi-track automatic kinship recognition evaluation that supports various visual kin-based problems on scales much higher than ever before. Organized in conjunction with the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG) as a Challenge, RFIW provides a platform for publishing original work and the gathering of experts for a discussion of the next steps. This paper summarizes the supported tasks (i.e., kinship verification, tri-subject verification, and search & retrieval of missing children) in the evaluation protocols, which include the practical motivation, technical background, data splits, metrics, and benchmark results. Furthermore, top submissions (i.e., leader-board stats) are listed and reviewed as a high-level analysis on the state of the problem. In the end, the purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions.

preprint2020arXiv

SAINT: Spatially Aware Interpolation NeTwork for Medical Slice Synthesis

Deep learning-based single image super-resolution (SISR) methods face various challenges when applied to 3D medical volumetric data (i.e., CT and MR images) due to the high memory cost and anisotropic resolution, which adversely affect their performance. Furthermore, mainstream SISR methods are designed to work over specific upsampling factors, which makes them ineffective in clinical practice. In this paper, we introduce a Spatially Aware Interpolation NeTwork (SAINT) for medical slice synthesis to alleviate the memory constraint that volumetric data poses. Compared to other super-resolution methods, SAINT utilizes voxel spacing information to provide desirable levels of details, and allows for the upsampling factor to be determined on the fly. Our evaluations based on 853 CT scans from four datasets that contain liver, colon, hepatic vessels, and kidneys show that SAINT consistently outperforms other SISR methods in terms of medical slice synthesis quality, while using only a single model to deal with different upsampling factors.

preprint2020arXiv

Spatial Priming for Detecting Human-Object Interactions

The relative spatial layout of a human and an object is an important cue for determining how they interact. However, until now, spatial layout has been used just as side-information for detecting human-object interactions (HOIs). In this paper, we present a method for exploiting this spatial layout information for detecting HOIs in images. The proposed method consists of a layout module which primes a visual module to predict the type of interaction between a human and an object. The visual and layout modules share information through lateral connections at several stages. The model uses predictions from the layout module as a prior to the visual module and the prediction from the visual module is given as the final output. It also incorporates semantic information about the object using word2vec vectors. The proposed model reaches an mAP of 24.79% for HICO-Det dataset which is about 2.8% absolute points higher than the current state-of-the-art.

preprint2020arXiv

The 4th AI City Challenge

The AI City Challenge was created to accelerate intelligent video analysis that helps make cities smarter and safer. Transportation is one of the largest segments that can benefit from actionable insights derived from data captured by sensors, where computer vision and deep learning have shown promise in achieving large-scale practical deployment. The 4th annual edition of the AI City Challenge has attracted 315 participating teams across 37 countries, who leveraged city-scale real traffic data and high-quality synthetic data to compete in four challenge tracks. Track 1 addressed video-based automatic vehicle counting, where the evaluation is conducted on both algorithmic effectiveness and computational efficiency. Track 2 addressed city-scale vehicle re-identification with augmented synthetic data to substantially increase the training set for the task. Track 3 addressed city-scale multi-target multi-camera vehicle tracking. Track 4 addressed traffic anomaly detection. The evaluation system shows two leader boards, in which a general leader board shows all submitted results, and a public leader board shows results limited to our contest participation rules, that teams are not allowed to use external data in their work. The public leader board shows results more close to real-world situations where annotated data are limited. Our results show promise that AI technology can enable smarter and safer transportation systems.

preprint2020arXiv

The Devil is in the Details: Self-Supervised Attention for Vehicle Re-Identification

In recent years, the research community has approached the problem of vehicle re-identification (re-id) with attention-based models, specifically focusing on regions of a vehicle containing discriminative information. These re-id methods rely on expensive key-point labels, part annotations, and additional attributes including vehicle make, model, and color. Given the large number of vehicle re-id datasets with various levels of annotations, strongly-supervised methods are unable to scale across different domains. In this paper, we present Self-supervised Attention for Vehicle Re-identification (SAVER), a novel approach to effectively learn vehicle-specific discriminative features. Through extensive experimentation, we show that SAVER improves upon the state-of-the-art on challenging VeRi, VehicleID, Vehicle-1M and VERI-Wild datasets.

preprint2020arXiv

Towards Gender-Neutral Face Descriptors for Mitigating Bias in Face Recognition

State-of-the-art deep networks implicitly encode gender information while being trained for face recognition. Gender is often viewed as an important attribute with respect to identifying faces. However, the implicit encoding of gender information in face descriptors has two major issues: (a.) It makes the descriptors susceptible to privacy leakage, i.e. a malicious agent can be trained to predict the face gender from such descriptors. (b.) It appears to contribute to gender bias in face recognition, i.e. we find a significant difference in the recognition accuracy of DCNNs on male and female faces. Therefore, we present a novel `Adversarial Gender De-biasing algorithm (AGENDA)' to reduce the gender information present in face descriptors obtained from previously trained face recognition networks. We show that AGENDA significantly reduces gender predictability of face descriptors. Consequently, we are also able to reduce gender bias in face verification while maintaining reasonable recognition performance.

preprint2020arXiv

Visual Question Answering on Image Sets

We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings. Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images. The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set. To enable research in this new topic, we introduce two ISVQA datasets - indoor and outdoor scenes. They simulate the real-world scenarios of indoor image collections and multiple car-mounted cameras, respectively. The indoor-scene dataset contains 91,479 human annotated questions for 48,138 image sets, and the outdoor-scene dataset has 49,617 questions for 12,746 image sets. We analyze the properties of the two datasets, including question-and-answer distributions, types of questions, biases in dataset, and question-image dependencies. We also build new baseline models to investigate new research challenges in ISVQA.

preprint2016arXiv

An All-In-One Convolutional Neural Network for Face Analysis

We present a multi-purpose algorithm for simultaneous face detection, face alignment, pose estimation, gender recognition, smile detection, age estimation and face recognition using a single deep convolutional neural network (CNN). The proposed method employs a multi-task learning framework that regularizes the shared parameters of CNN and builds a synergy among different domains and tasks. Extensive experiments show that the network has a better understanding of face and achieves state-of-the-art result for most of these tasks.

preprint2016arXiv

Attributes for Improved Attributes: A Multi-Task Network for Attribute Classification

Attributes, or semantic features, have gained popularity in the past few years in domains ranging from activity recognition in video to face verification. Improving the accuracy of attribute classifiers is an important first step in any application which uses these attributes. In most works to date, attributes have been considered to be independent. However, we know this not to be the case. Many attributes are very strongly related, such as heavy makeup and wearing lipstick. We propose to take advantage of attribute relationships in three ways: by using a multi-task deep convolutional neural network (MCNN) sharing the lowest layers amongst all attributes, sharing the higher layers for related attributes, and by building an auxiliary network on top of the MCNN which utilizes the scores from all attributes to improve the final classification of each attribute. We demonstrate the effectiveness of our method by producing results on two challenging publicly available datasets.

preprint2016arXiv

Convolutional Neural Networks for Attribute-based Active Authentication on Mobile Devices

We present a Deep Convolutional Neural Network (DCNN) architecture for the task of continuous authentication on mobile devices. To deal with the limited resources of these devices, we reduce the complexity of the networks by learning intermediate features such as gender and hair color instead of identities. We present a multi-task, part-based DCNN architecture for attribute detection that performs better than the state-of-the-art methods in terms of accuracy. As a byproduct of the proposed architecture, we are able to explore the embedding space of the attributes extracted from different facial parts, such as mouth and eyes, to discover new attributes. Furthermore, through extensive experimentation, we show that the attribute features extracted by our method outperform the previously presented attribute-based method and a baseline LBP method for the task of active authentication. Lastly, we demonstrate the effectiveness of the proposed architecture in terms of speed and power consumption by deploying it on an actual mobile device.

preprint2016arXiv

DCNNs on a Diet: Sampling Strategies for Reducing the Training Set Size

Large-scale supervised classification algorithms, especially those based on deep convolutional neural networks (DCNNs), require vast amounts of training data to achieve state-of-the-art performance. Decreasing this data requirement would significantly speed up the training process and possibly improve generalization. Motivated by this objective, we consider the task of adaptively finding concise training subsets which will be iteratively presented to the learner. We use convex optimization methods, based on an objective criterion and feedback from the current performance of the classifier, to efficiently identify informative samples to train on. We propose an algorithm to decompose the optimization problem into smaller per-class problems, which can be solved in parallel. We test our approach on standard classification tasks and demonstrate its effectiveness in decreasing the training set size without compromising performance. We also show that our approach can make the classifier more robust in the presence of label noise and class imbalance.

preprint2016arXiv

Deep Feature-based Face Detection on Mobile Devices

We propose a deep feature-based face detector for mobile devices to detect user's face acquired by the front facing camera. The proposed method is able to detect faces in images containing extreme pose and illumination variations as well as partial faces. The main challenge in developing deep feature-based algorithms for mobile devices is the constrained nature of the mobile platform and the non-availability of CUDA enabled GPUs on such devices. Our implementation takes into account the special nature of the images captured by the front-facing camera of mobile devices and exploits the GPUs present in mobile devices without CUDA-based frameorks, to meet these challenges.

preprint2016arXiv

Face Alignment by Local Deep Descriptor Regression

We present an algorithm for extracting key-point descriptors using deep convolutional neural networks (CNN). Unlike many existing deep CNNs, our model computes local features around a given point in an image. We also present a face alignment algorithm based on regression using these local descriptors. The proposed method called Local Deep Descriptor Regression (LDDR) is able to localize face landmarks of varying sizes, poses and occlusions with high accuracy. Deep Descriptors presented in this paper are able to uniquely and efficiently describe every pixel in the image and therefore can potentially replace traditional descriptors such as SIFT and HOG. Extensive evaluations on five publicly available unconstrained face alignment datasets show that our deep descriptor network is able to capture strong local features around a given landmark and performs significantly better than many competitive and state-of-the-art face alignment algorithms.

preprint2016arXiv

FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition

Relatively small data sets available for expression recognition research make the training of deep networks for expression recognition very challenging. Although fine-tuning can partially alleviate the issue, the performance is still below acceptable levels as the deep features probably contain redun- dant information from the pre-trained domain. In this paper, we present FaceNet2ExpNet, a novel idea to train an expression recognition network based on static images. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully- connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases, CK+, Oulu-CASIA, TFD, and SFEW demonstrate that our method achieves better results than state-of-the-art.

preprint2016arXiv

Optimized Kernel-based Projection Space of Riemannian Manifolds

It is proven that encoding images and videos through Symmetric Positive Definite (SPD) matrices, and considering the Riemannian geometry of the resulting space, can lead to increased classification performance. Taking into account manifold geometry is typically done via embedding the manifolds in tangent spaces, or Reproducing Kernel Hilbert Spaces (RKHS). Recently, it was shown that embedding such manifolds into a Random Projection Spaces (RPS), rather than RKHS or tangent space, leads to higher classification and clustering performance. However, based on structure and dimensionality of the randomly generated hyperplanes, the classification performance over RPS may vary significantly. In addition, fine-tuning RPS is data expensive (as it requires validation-data), time consuming, and resource demanding. In this paper, we introduce an approach to learn an optimized kernel-based projection (with fixed dimensionality), by employing the concept of subspace clustering. As such, we encode the association of data points to the underlying subspace of each point, to generate meaningful hyperplanes. Further, we adopt the concept of dictionary learning and sparse coding, and discriminative analysis, for the optimized kernel-based projection space (OPS) on SPD manifolds. We validate our algorithm on several classification tasks. The experiment results also demonstrate that the proposed method outperforms state-of-the-art methods on such manifolds.

preprint2016arXiv

Towards the Design of an End-to-End Automated System for Image and Video-based Recognition

Over many decades, researchers working in object recognition have longed for an end-to-end automated system that will simply accept 2D or 3D image or videos as inputs and output the labels of objects in the input data. Computer vision methods that use representations derived based on geometric, radiometric and neural considerations and statistical and structural matchers and artificial neural network-based methods where a multi-layer network learns the mapping from inputs to class labels have provided competing approaches for image recognition problems. Over the last four years, methods based on Deep Convolutional Neural Networks (DCNNs) have shown impressive performance improvements on object detection/recognition challenge problems. This has been made possible due to the availability of large annotated data, a better understanding of the non-linear mapping between image and class labels as well as the affordability of GPUs. In this paper, we present a brief history of developments in computer vision and artificial neural networks over the last forty years for the problem of image-based recognition. We then present the design details of a deep learning system for end-to-end unconstrained face verification/recognition. Some open issues regarding DCNNs for object recognition problems are then discussed. We caution the readers that the views expressed in this paper are from the authors and authors only!

preprint2016arXiv

Triplet Similarity Embedding for Face Verification

In this work, we present an unconstrained face verification algorithm and evaluate it on the recently released IJB-A dataset that aims to push the boundaries of face verification methods. The proposed algorithm couples a deep CNN-based approach with a low-dimensional discriminative embedding learnt using triplet similarity constraints in a large margin fashion. Aside from yielding performance improvement, this embedding provides significant advantages in terms of memory and post-processing operations like hashing and visualization. Experiments on the IJB-A dataset show that the proposed algorithm outperforms state of the art methods in verification and identification metrics, while requiring less training time.

preprint2016arXiv

Unconstrained Face Verification using Deep CNN Features

In this paper, we present an algorithm for unconstrained face verification based on deep convolutional features and evaluate it on the newly released IARPA Janus Benchmark A (IJB-A) dataset. The IJB-A dataset includes real-world unconstrained faces from 500 subjects with full pose and illumination variations which are much harder than the traditional Labeled Face in the Wild (LFW) and Youtube Face (YTF) datasets. The deep convolutional neural network (DCNN) is trained using the CASIA-WebFace dataset. Extensive experiments on the IJB-A dataset are provided.

preprint2015arXiv

A Deep Pyramid Deformable Part Model for Face Detection

We present a face detection algorithm based on Deformable Part Models and deep pyramidal features. The proposed method called DP2MFD is able to detect faces of various sizes and poses in unconstrained conditions. It reduces the gap in training and testing of DPM on deep features by adding a normalization layer to the deep convolutional neural network (CNN). Extensive experiments on four publicly available unconstrained face detection datasets show that our method is able to capture the meaningful structure of faces and performs significantly better than many competitive face detection algorithms.

preprint2015arXiv

Compositional Dictionaries for Domain Adaptive Face Recognition

We present a dictionary learning approach to compensate for the transformation of faces due to changes in view point, illumination, resolution, etc. The key idea of our approach is to force domain-invariant sparse coding, i.e., design a consistent sparse representation of the same face in different domains. In this way, classifiers trained on the sparse codes in the source domain consisting of frontal faces for example can be applied to the target domain (consisting of faces in different poses, illumination conditions, etc) without much loss in recognition accuracy. The approach is to first learn a domain base dictionary, and then describe each domain shift (identity, pose, illumination) using a sparse representation over the base dictionary. The dictionary adapted to each domain is expressed as sparse linear combinations of the base dictionary. In the context of face recognition, with the proposed compositional dictionary approach, a face image can be decomposed into sparse representations for a given subject, pose and illumination respectively. This approach has three advantages: first, the extracted sparse representation for a subject is consistent across domains and enables pose and illumination insensitive face recognition. Second, sparse representations for pose and illumination can subsequently be used to estimate the pose and illumination condition of a face image. Finally, by composing sparse representations for subject and the different domains, we can also perform pose alignment and illumination normalization. Extensive experiments using two public face datasets are presented to demonstrate the effectiveness of our approach for face recognition.

preprint2015arXiv

Deep Multi-task Learning for Railway Track Inspection

Railroad tracks need to be periodically inspected and monitored to ensure safe transportation. Automated track inspection using computer vision and pattern recognition methods have recently shown the potential to improve safety by allowing for more frequent inspections while reducing human errors. Achieving full automation is still very challenging due to the number of different possible failure modes as well as the broad range of image variations that can potentially trigger false alarms. Also, the number of defective components is very small, so not many training examples are available for the machine to learn a robust anomaly detector. In this paper, we show that detection performance can be improved by combining multiple detectors within a multi-task learning framework. We show that this approach results in better accuracy in detecting defects on railway ties and fasteners.

preprint2015arXiv

Sequential Score Adaptation with Extreme Value Theory for Robust Railway Track Inspection

Periodic inspections are necessary to keep railroad tracks in state of good repair and prevent train accidents. Automatic track inspection using machine vision technology has become a very effective inspection tool. Because of its non-contact nature, this technology can be deployed on virtually any railway vehicle to continuously survey the tracks and send exception reports to track maintenance personnel. However, as appearance and imaging conditions vary, false alarm rates can dramatically change, making it difficult to select a good operating point. In this paper, we use extreme value theory (EVT) within a Bayesian framework to optimally adjust the sensitivity of anomaly detectors. We show that by approximating the lower tail of the probability density function (PDF) of the scores with an Exponential distribution (a special case of the Generalized Pareto distribution), and using the Gamma conjugate prior learned from the training data, it is possible to reduce the variability in false alarm rate and improve the overall performance. This method has shown an increase in the defect detection rate of rail fasteners in the presence of clutter (at PFA 0.1%) from 95.40% to 99.26% on the 85-mile Northeast Corridor (NEC) 2012-2013 concrete tie dataset.

preprint2014arXiv

Growing Regression Forests by Classification: Applications to Object Pose Estimation

In this work, we propose a novel node splitting method for regression trees and incorporate it into the regression forest framework. Unlike traditional binary splitting, where the splitting rule is selected from a predefined set of binary splitting rules via trial-and-error, the proposed node splitting method first finds clusters of the training data which at least locally minimize the empirical loss without considering the input space. Then splitting rules which preserve the found clusters as much as possible are determined by casting the problem into a classification problem. Consequently, our new node splitting method enjoys more freedom in choosing the splitting rules, resulting in more efficient tree structures. In addition to the Euclidean target space, we present a variant which can naturally deal with a circular target space by the proper use of circular statistics. We apply the regression forest employing our node splitting to head pose estimation (Euclidean target space) and car direction estimation (circular target space) and demonstrate that the proposed method significantly outperforms state-of-the-art methods (38.5% and 22.5% error reduction respectively).

preprint2014arXiv

MKL-RT: Multiple Kernel Learning for Ratio-trace Problems via Convex Optimization

In the recent past, automatic selection or combination of kernels (or features) based on multiple kernel learning (MKL) approaches has been receiving significant attention from various research communities. Though MKL has been extensively studied in the context of support vector machines (SVM), it is relatively less explored for ratio-trace problems. In this paper, we show that MKL can be formulated as a convex optimization problem for a general class of ratio-trace problems that encompasses many popular algorithms used in various computer vision applications. We also provide an optimization procedure that is guaranteed to converge to the global optimum of the proposed optimization problem. We experimentally demonstrate that the proposed MKL approach, which we refer to as MKL-RT, can be successfully used to select features for discriminative dimensionality reduction and cross-modal retrieval. We also show that the proposed convex MKL-RT approach performs better than the recently proposed non-convex MKL-DR approach.

preprint2013arXiv

Compressive Acquisition of Dynamic Scenes

Compressive sensing (CS) is a new approach for the acquisition and recovery of sparse signals and images that enables sampling rates significantly below the classical Nyquist rate. Despite significant progress in the theory and methods of CS, little headway has been made in compressive video acquisition and recovery. Video CS is complicated by the ephemeral nature of dynamic events, which makes direct extensions of standard CS imaging architectures and signal models difficult. In this paper, we develop a new framework for video CS for dynamic textured scenes that models the evolution of the scene as a linear dynamical system (LDS). This reduces the video recovery problem to first estimating the model parameters of the LDS from compressive measurements, and then reconstructing the image frames. We exploit the low-dimensional dynamic parameters (the state sequence) and high-dimensional static parameters (the observation matrix) of the LDS to devise a novel compressive measurement strategy that measures only the dynamic part of the scene at each instant and accumulates measurements over time to estimate the static parameters. This enables us to lower the compressive measurement rate considerably. We validate our approach with a range of experiments involving both video recovery, sensing hyper-spectral data, and classification of dynamic scenes from compressive data. Together, these applications demonstrate the effectiveness of the approach.

preprint2013arXiv

Sparse Dictionary-based Attributes for Action Recognition and Summarization

We present an approach for dictionary learning of action attributes via information maximization. We unify the class distribution and appearance information into an objective function for learning a sparse dictionary of action attributes. The objective function maximizes the mutual information between what has been learned and what remains to be learned in terms of appearance information and class distribution for each dictionary atom. We propose a Gaussian Process (GP) model for sparse representation to optimize the dictionary objective function. The sparse coding property allows a kernel with compact support in GP to realize a very efficient dictionary learning process. Hence we can describe an action video by a set of compact and discriminative action attributes. More importantly, we can recognize modeled action categories in a sparse feature space, which can be generalized to unseen and unmodeled action categories. Experimental results demonstrate the effectiveness of our approach in action recognition and summarization.

preprint2012arXiv

A Unified Approach for Modeling and Recognition of Individual Actions and Group Activities

Recognizing group activities is challenging due to the difficulties in isolating individual entities, finding the respective roles played by the individuals and representing the complex interactions among the participants. Individual actions and group activities in videos can be represented in a common framework as they share the following common feature: both are composed of a set of low-level features describing motions, e.g., optical flow for each pixel or a trajectory for each feature point, according to a set of composition constraints in both temporal and spatial dimensions. In this paper, we present a unified model to assess the similarity between two given individual or group activities. Our approach avoids explicit extraction of individual actors, identifying and representing the inter-person interactions. With the proposed approach, retrieval from a video database can be performed through Query-by-Example; and activities can be recognized by querying videos containing known activities. The suggested video matching process can be performed in an unsupervised manner. We demonstrate the performance of our approach by recognizing a set of human actions and football plays.

preprint2012arXiv

Information-theoretic Dictionary Learning for Image Classification

We present a two-stage approach for learning dictionaries for object classification tasks based on the principle of information maximization. The proposed method seeks a dictionary that is compact, discriminative, and generative. In the first stage, dictionary atoms are selected from an initial dictionary by maximizing the mutual information measure on dictionary compactness, discrimination and reconstruction. In the second stage, the selected dictionary atoms are updated for improved reconstructive and discriminative power using a simple gradient ascent algorithm on mutual information. Experiments using real datasets demonstrate the effectiveness of our approach for image classification tasks.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Computer Vision Machine Learning eess.IV Artificial Intelligence Cryptography and Security Databases Information Theory math.IT Robotics

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.09355:author:4:rama-chellappa

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2604.27218:author:5:rama-chellappa

Imported May 20, 2026Synced May 20, 2026

arxivconfidence 95%

external id: arxiv:2605.06876:author:4:rama-chellappa

Imported May 20, 2026Synced May 20, 2026

7 works

Cheng Peng

Researcher

Cheng Peng contributes to research discovery and scholarly infrastructure.

Open to collaborate

7 works

Vishal M. Patel

Researcher

Vishal M. Patel contributes to research discovery and scholarly infrastructure.

Open to collaborate

4 works

Carlos D. Castillo

Researcher

Carlos D. Castillo contributes to research discovery and scholarly infrastructure.

Open to collaborate

4 works

Chun Pong Lau

Researcher

Chun Pong Lau contributes to research discovery and scholarly infrastructure.

Open to collaborate

Rama Chellappa

What is connected

Connect this record

See the researcher in context

Building this map preview

47 published item(s)

AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting

AttriBE: Quantifying Attribute Expressivity in Body Embeddings for Recognition and Identification

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

SciFig: Towards Automating Scientific Figure Generation

Distill and De-bias: Mitigating Bias in Face Verification using Knowledge Distillation

HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet

PDRF: Progressively Deblurring Radiance Field for Fast and Robust Scene Reconstruction from Blurry Images

REGAS: REspiratory-GAted Synthesis of Views for Multi-Phase CBCT Reconstruction from a single 3D CBCT Acquisition

Scalable and Real-time Multi-Camera Vehicle Detection, Re-Identification, and Tracking

Scalable Vehicle Re-Identification via Self-Supervision

Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection

The 6th AI City Challenge

Towards performant and reliable undersampled MR reconstruction via diffusion model sampling

XraySyn: Realistic View Synthesis From a Single Radiograph Through CT Priors

3DRegNet: A Deep Neural Network for 3D Point Registration

ATFaceGAN: Single Face Image Restoration and Recognition from Atmospheric Turbulence

Detecting Human-Object Interactions via Functional Generalization

Dual Manifold Adversarial Robustness: Defense against Lp and non-Lp Adversarial Attacks

Occlusion-Adaptive Deep Network for Robust Facial Expression Recognition

Recognizing Families In the Wild: White Paper for the 4th Edition Data Challenge

SAINT: Spatially Aware Interpolation NeTwork for Medical Slice Synthesis

Spatial Priming for Detecting Human-Object Interactions

The 4th AI City Challenge

The Devil is in the Details: Self-Supervised Attention for Vehicle Re-Identification

Towards Gender-Neutral Face Descriptors for Mitigating Bias in Face Recognition

Visual Question Answering on Image Sets

An All-In-One Convolutional Neural Network for Face Analysis

Attributes for Improved Attributes: A Multi-Task Network for Attribute Classification

Convolutional Neural Networks for Attribute-based Active Authentication on Mobile Devices

DCNNs on a Diet: Sampling Strategies for Reducing the Training Set Size

Deep Feature-based Face Detection on Mobile Devices

Face Alignment by Local Deep Descriptor Regression

FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition

Optimized Kernel-based Projection Space of Riemannian Manifolds

Towards the Design of an End-to-End Automated System for Image and Video-based Recognition

Triplet Similarity Embedding for Face Verification

Unconstrained Face Verification using Deep CNN Features

A Deep Pyramid Deformable Part Model for Face Detection

Compositional Dictionaries for Domain Adaptive Face Recognition

Deep Multi-task Learning for Railway Track Inspection

Sequential Score Adaptation with Extreme Value Theory for Robust Railway Track Inspection

Growing Regression Forests by Classification: Applications to Object Pose Estimation

MKL-RT: Multiple Kernel Learning for Ratio-trace Problems via Convex Optimization

Compressive Acquisition of Dynamic Scenes

Sparse Dictionary-based Attributes for Action Recognition and Summarization

A Unified Approach for Modeling and Recognition of Individual Actions and Group Activities

Information-theoretic Dictionary Learning for Image Classification