Source author record

Rita Singh

Rita Singh appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Machine Learning Sound eess.AS Computer Vision Computation and Language cond-mat.mtrl-sci Cryptography and Security Multimedia

Catalog footprint

What is connected

10works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

VerLM: Explaining Face Verification Using Natural Language

Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model's accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.

preprint2022arXiv

Self-supervision and Learnable STRFs for Age, Emotion, and Country Prediction

This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets. We also introduce a simple score fusion mechanism to leverage the complementarity of different feature sets for this task. We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412.

preprint2022arXiv

SphereFace Revived: Unifying Hyperspherical Face Recognition

This paper addresses the deep face recognition problem under an open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. To this end, hyperspherical face recognition, as a promising line of research, has attracted increasing attention and gradually become a major focus in face recognition research. As one of the earliest works in hyperspherical face recognition, SphereFace explicitly proposed to learn face embeddings with large inter-class angular margin. However, SphereFace still suffers from severe training instability which limits its application in practice. In order to address this problem, we introduce a unified framework to understand large angular margin in hyperspherical face recognition. Under this framework, we extend the study of SphereFace and propose an improved variant with substantially better training stability -- SphereFace-R. Specifically, we propose two novel ways to implement the multiplicative margin, and study SphereFace-R under three different feature normalization schemes (no feature normalization, hard feature normalization and soft feature normalization). We also propose an implementation strategy -- "characteristic gradient detachment" -- to stabilize training. Extensive experiments on SphereFace-R show that it is consistently better than or competitive with state-of-the-art methods.

preprint2022arXiv

SphereFace2: Binary Classification is All You Need for Deep Face Recognition

State-of-the-art deep face recognition methods are mostly trained with a softmax-based multi-class classification framework. Despite being popular and effective, these methods still have a few shortcomings that limit empirical performance. In this paper, we start by identifying the discrepancy between training and evaluation in the existing multi-class classification framework and then discuss the potential limitations caused by the "competitive" nature of softmax normalization. Motivated by these limitations, we propose a novel binary classification training framework, termed SphereFace2. In contrast to existing methods, SphereFace2 circumvents the softmax normalization, as well as the corresponding closed-set assumption. This effectively bridges the gap between training and evaluation, enabling the representations to be improved individually by each binary classification task. Besides designing a specific well-performing loss function, we summarize a few general principles for this "one-vs-all" binary classification framework so that it can outperform current competitive methods. Our experiments on popular benchmarks demonstrate that SphereFace2 can consistently outperform state-of-the-art deep face recognition methods. The code has been made publicly available.

preprint2020arXiv

Hide and Speak: Towards Deep Neural Networks for Speech Steganography

Steganography is the science of hiding a secret message within an ordinary public message, which is referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. We showed that steganography models proposed for vision are less suitable for speech, and propose a new model that includes the short-time Fourier transform and inverse-short-time Fourier transform as differentiable layers within the network, thus imposing a vital constraint on the network outputs. We empirically demonstrated the effectiveness of the proposed method comparing to deep learning based on several speech datasets and analyzed the results quantitatively and qualitatively. Moreover, we showed that the proposed approach could be applied to conceal multiple messages in a single carrier using multiple decoders or a single conditional decoder. Lastly, we evaluated our model under different channel distortions. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded messages are highly intelligible.

preprint2020arXiv

Speech-Based Parameter Estimation of an Asymmetric Vocal Fold Oscillation Model and Its Application in Discriminating Vocal Fold Pathologies

So far, several physical models have been proposed for the study of vocal fold oscillations during phonation. The parameters of these models, such as vocal fold elasticity, resistance, etc. are traditionally determined through the observation and measurement of the vocal fold vibrations in the larynx. Since such direct measurements tend to be the most accurate, the traditional practice has been to set the parameter values of these models based on measurements that are averaged across an ensemble of human subjects. However, the direct measurement process is hard to revise outside of clinical settings. In many cases, especially in pathological ones, the properties of the vocal folds often deviate from their generic values---sometimes asymmetrically wherein the characteristics of the two vocal folds differ for the same individual. In such cases, it is desirable to find a more scalable way to adjust the model parameters on a case by case basis. In this paper, we present a novel and alternate way to determine vocal fold model parameters from the speech signal. We focus on an asymmetric model and show that for such models, differences in estimated parameters can be successfully used to discriminate between voices that are characteristic of different underlying vocal fold pathologies.

preprint2020arXiv

The phonetic bases of vocal expressed emotion: natural versus acted

Can vocal emotions be emulated? This question has been a recurrent concern of the speech community, and has also been vigorously investigated. It has been fueled further by its link to the issue of validity of acted emotion databases. Much of the speech and vocal emotion research has relied on acted emotion databases as valid proxies for studying natural emotions. To create models that generalize to natural settings, it is crucial to work with valid prototypes -- ones that can be assumed to reliably represent natural emotions. More concretely, it is important to study emulated emotions against natural emotions in terms of their physiological, and psychological concomitants. In this paper, we present an on-scale systematic study of the differences between natural and acted vocal emotions. We use a self-attention based emotion classification model to understand the phonetic bases of emotions by discovering the most 'attended' phonemes for each class of emotions. We then compare these attended-phonemes in their importance and distribution across acted and natural classes. Our tests show significant differences in the manner and choice of phonemes in acted and natural speech, concluding moderate to low validity and value in using acted speech databases for emotion classification tasks.

preprint2019arXiv

Neural Regression Trees

Regression-via-Classification (RvC) is the process of converting a regression problem to a classification one. Current approaches for RvC use ad-hoc discretization strategies and are suboptimal. We propose a neural regression tree model for RvC. In this model, we employ a joint optimization framework where we learn optimal discretization thresholds while simultaneously optimizing the features for each node in the tree. We empirically show the validity of our model by testing it on two challenging regression tasks where we establish the state of the art.

preprint2015arXiv

Plagiarism Detection in Polyphonic Music using Monaural Signal Separation

Given the large number of new musical tracks released each year, automated approaches to plagiarism detection are essential to help us track potential violations of copyright. Most current approaches to plagiarism detection are based on musical similarity measures, which typically ignore the issue of polyphony in music. We present a novel feature space for audio derived from compositional modelling techniques, commonly used in signal separation, that provides a mechanism to account for polyphony without incurring an inordinate amount of computational overhead. We employ this feature representation in conjunction with traditional audio feature representations in a classification framework which uses an ensemble of distance features to characterize pairs of songs as being plagiarized or not. Our experiments on a database of about 3000 musical track pairs show that the new feature space characterization produces significant improvements over standard baselines.

preprint2011arXiv

Synthesis and structural/microstructural characteristics of antimony doped tin oxide $(Sn_{1-x}Sb_{x}O_{2-δ})$

Bulk samples of $(Sn_{1-x}Sb_{x}O_{2-δ})$ with x = 0.00, 0.10, 0.20, 0.30 are synthesized by solid-state reaction route. Samples were characterized by X-ray powder diffraction (XRD), scanning electron microscopy (SEM), transmission electron microscopy (TEM) and UV-Vis spectroscopy. The x-ray diffraction patterns indicate that the gross structure/phase of $(Sn_{1-x}$ $Sb_{x}O_{2-δ})$ do not change with the substitution of antimony (Sb) up to x = 0.30. The surface morphological examination with SEM revealed the fact that the grain size in the antimony doped sample is larger than that of undoped one and hence pores/voids between the grains increase with Sb concentration up to 0.30. TEM image of undoped sample indicates that the $SnO_{2}$ grains have diameters ranging from 25 to 120 nm and most grains are in cubic or spherical shape. As antimony content increases, the nanocubes/spheres are converted into microcubes/spheres. The reflectance of $Sn_{1-x}Sb_{x}O_{2-δ}$ samples increases whereas absorbance of these samples decreases with the increased concentration of antimony (Sb) for the wavelength range 360 - 800 nm. The energy bandgap of Sb doped - $SnO_{2}$ samples were obtained from optical absorption spectra by UV-Vis absorption spectroscopy. Upon increasing the Sb concentration the bandgap of the samples was found to increase from 3.367 eV to 3.558 eV.

Rita Singh

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

VerLM: Explaining Face Verification Using Natural Language

Self-supervision and Learnable STRFs for Age, Emotion, and Country Prediction

SphereFace Revived: Unifying Hyperspherical Face Recognition

SphereFace2: Binary Classification is All You Need for Deep Face Recognition

Hide and Speak: Towards Deep Neural Networks for Speech Steganography

Speech-Based Parameter Estimation of an Asymmetric Vocal Fold Oscillation Model and Its Application in Discriminating Vocal Fold Pathologies

The phonetic bases of vocal expressed emotion: natural versus acted

Neural Regression Trees

Plagiarism Detection in Polyphonic Music using Monaural Signal Separation

Synthesis and structural/microstructural characteristics of antimony doped tin oxide $(Sn_{1-x}Sb_{x}O_{2-δ})$