Source author record

Anshuman Tripathi

Anshuman Tripathi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Machine Learning Sound astro-ph.CO Computation and Language

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

Exploring Machine Learning Regression Models for Advancing Foreground Mitigation and Global 21cm Signal Parameter Extraction

Extracting parameters from the global 21cm signal is crucial for understanding the early Universe. However, detecting the 21cm signal is challenging due to the brighter foreground and associated observational difficulties. In this study, we evaluate the performance of various machine-learning regression models to improve parameter extraction and foreground removal. This evaluation is essential for selecting the most suitable machine learning regression model based on computational efficiency and predictive accuracy. We compare four models: Random Forest Regressor (RFR), Gaussian Process Regressor (GPR), Support Vector Regressor (SVR), and Artificial Neural Networks (ANN). The comparison is based on metrics such as the root mean square error (RMSE) and $R^2$ scores. We examine their effectiveness across different dataset sizes and conditions, including scenarios with foreground contamination. Our results indicate that ANN consistently outperforms the other models, achieving the lowest RMSE and the highest $R^2$ scores across multiple cases. While GPR also performs well, it is computationally intensive, requiring significant RAM and longer execution times. SVR struggles with large datasets due to its high computational costs, and RFR demonstrates the weakest accuracy among the models tested. We also found that employing Principal Component Analysis (PCA) as a preprocessing step significantly enhances model performance, especially in the presence of foregrounds.

preprint2022arXiv

Contrastive Siamese Network for Semi-supervised Speech Recognition

This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A c-siam network with 450M parameters achieves competitive results compared to the state-of-the-art networks with 600M parameters.

preprint2022arXiv

Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection

In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.

preprint2020arXiv

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.