Source author record

Saptarshi Chakraborty

Saptarshi Chakraborty appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Methodology math.ST Statistics Theory Applications Computation Computer Vision

Catalog footprint

What is connected

8works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Bregman Power k-Means for Clustering Exponential Family Data

Recent progress in center-based clustering algorithms combats poor local minima by implicit annealing, using a family of generalized means. These methods are variations of Lloyd's celebrated $k$-means algorithm, and are most appropriate for spherical clusters such as those arising from Gaussian data. In this paper, we bridge these algorithmic advances to classical work on hard clustering under Bregman divergences, which enjoy a bijection to exponential family distributions and are thus well-suited for clustering objects arising from a breadth of data generating mechanisms. The elegant properties of Bregman divergences allow us to maintain closed form updates in a simple and transparent algorithm, and moreover lead to new theoretical arguments for establishing finite sample bounds that relax the bounded support assumption made in the existing state of the art. Additionally, we consider thorough empirical analyses on simulated experiments and a case study on rainfall data, finding that the proposed method outperforms existing peer methods in a variety of non-Gaussian data settings.

preprint2022arXiv

Robust Linear Predictions: Analyses of Uniform Concentration, Fast Rates and Model Misspecification

The problem of linear predictions has been extensively studied for the past century under pretty generalized frameworks. Recent advances in the robust statistics literature allow us to analyze robust versions of classical linear models through the prism of Median of Means (MoM). Combining these approaches in a piecemeal way might lead to ad-hoc procedures, and the restricted theoretical conclusions that underpin each individual contribution may no longer be valid. To meet these challenges coherently, in this study, we offer a unified robust framework that includes a broad variety of linear prediction problems on a Hilbert space, coupled with a generic class of loss functions. Notably, we do not require any assumptions on the distribution of the outlying data points ($\mathcal{O}$) nor the compactness of the support of the inlying ones ($\mathcal{I}$). Under mild conditions on the dual norm, we show that for misspecification level $ε$, these estimators achieve an error rate of $O(\max\left\{|\mathcal{O}|^{1/2}n^{-1/2}, |\mathcal{I}|^{1/2}n^{-1} \right\}+ε)$, matching the best-known rates in literature. This rate is slightly slower than the classical rates of $O(n^{-1/2})$, indicating that we need to pay a price in terms of error rates to obtain robust estimates. Additionally, we show that this rate can be improved to achieve so-called "fast rates" under additional assumptions.

preprint2022arXiv

Topical Hidden Genome: Discovering Latent Cancer Mutational Topics using a Bayesian Multilevel Context-learning Approach

Statistical inference on the cancer-site specificities of collective ultra-rare whole genome somatic mutations is an open problem. Traditional statistical methods cannot handle whole-genome mutation data due to their ultra-high-dimensionality and extreme data sparsity -- e.g., >30 million unique variants are observed in the ~1700 whole-genome tumor dataset considered herein, of which >99% variants are encountered only once. To harness information in these rare variants we have recently proposed the "hidden genome model", a formal multilevel multi-logistic model that mines information in ultra-rare somatic variants to characterize tumor types. The model condenses signals in rare variants through a hierarchical layer leveraging contexts of individual mutations. The model is currently implemented using consistent, scalable point estimation techniques that can handle 10s of millions of variants detected across thousands of tumors. Our recent publications have evidenced its impressive accuracy and attributability at scale. However, principled statistical inference from the model is infeasible due to the volume, correlation, and non-interpretability of the mutation contexts. In this paper we propose a novel framework that leverages topic models from the field of computational linguistics to induce an *interpretable dimension reduction* of the mutation contexts used in the model. The proposed model is implemented using an efficient MCMC algorithm that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of out-of-the-box high-dimensional multi-class regression methods and software. We employ our model on the Pan Cancer Analysis of Whole Genomes (PCAWG) dataset, and our results reveal interesting novel insights.

preprint2020arXiv

Entropy Regularized Power k-Means Clustering

Despite its well-known shortcomings, $k$-means remains one of the most widely used approaches to data clustering. Current research continues to tackle its flaws while attempting to preserve its simplicity. Recently, the \textit{power $k$-means} algorithm was proposed to avoid trapping in local minima by annealing through a family of smoother surfaces. However, the approach lacks theoretical justification and fails in high dimensions when many features are irrelevant. This paper addresses these issues by introducing \textit{entropy regularization} to learn feature relevance while annealing. We prove consistency of the proposed approach and derive a scalable majorization-minimization algorithm that enjoys closed-form updates and convergence guarantees. In particular, our method retains the same computational complexity of $k$-means and power $k$-means, but yields significant improvements over both. Its merits are thoroughly assessed on a suite of real and synthetic data experiments.

preprint2020arXiv

On the circular correlation coefficients for bivariate von Mises distributions on a torus

This paper studies circular correlations for the bivariate von Mises sine and cosine distributions. These are two simple and appealing models for bivariate angular data with five parameters each that have interpretations comparable to those in the ordinary bivariate normal model. However, the variability and association of the angle pairs cannot be easily deduced from the model parameters unlike the bivariate normal. Thus to compute such summary measures, tools from circular statistics are needed. We derive analytic expressions and study the properties of the Jammalamadaka-Sarma and Fisher-Lee circular correlation coefficients for the von Mises sine and cosine models. Likelihood-based inference of these coefficients from sample data is then presented. The correlation coefficients are illustrated with numerical and visual examples, and the maximum likelihood estimators are assessed on simulated and real data, with comparisons to their non-parametric counterparts. Implementations of these computations for practical use are provided in our R package BAMBI.

preprint2020arXiv

Principal Ellipsoid Analysis (PEA): Efficient non-linear dimension reduction & clustering

Even with the rise in popularity of over-parameterized models, simple dimensionality reduction and clustering methods, such as PCA and k-means, are still routinely used in an amazing variety of settings. A primary reason is the combination of simplicity, interpretability and computational efficiency. The focus of this article is on improving upon PCA and k-means, by allowing non-linear relations in the data and more flexible cluster shapes, without sacrificing the key advantages. The key contribution is a new framework for Principal Elliptical Analysis (PEA), defining a simple and computationally efficient alternative to PCA that fits the best elliptical approximation through the data. We provide theoretical guarantees on the proposed PEA algorithm using Vapnik-Chervonenkis (VC) theory to show strong consistency and uniform concentration bounds. Toy experiments illustrate the performance of PEA, and the ability to adapt to non-linear structure and complex cluster shapes. In a rich variety of real data clustering applications, PEA is shown to do as well as k-means for simple datasets, while dramatically improving performance in more complex settings.

preprint2020arXiv

Using the "Hidden" Genome to Improve Classification of Cancer Types

It is increasingly common clinically for cancer specimens to be examined using techniques that identify somatic mutations. In principle these mutational profiles can be used to diagnose the tissue of origin, a critical task for the 3-5% of tumors that have an unknown primary site. Diagnosis of primary site is also critical for screening tests that employ circulating DNA. However, most mutations observed in any new tumor are very rarely occurring mutations, and indeed the preponderance of these may never have been observed in any previous recorded tumor. To create a viable diagnostic tool we need to harness the information content in this "hidden genome" of variants for which no direct information is available. To accomplish this we propose a multi-level meta-feature regression to extract the critical information from rare variants in the training data in a way that permits us to also extract diagnostic information from any previously unobserved variants in the new tumor sample. A scalable implementation of the model is obtained by combining a high-dimensional feature screening approach with a group-lasso penalized maximum likelihood approach based on an equivalent mixed-effect representation of the multilevel model. We apply the method to the Cancer Genome Atlas whole-exome sequencing data set including 3702 tumor samples across 7 common cancer sites. Results show that our multi-level approach can harness substantial diagnostic information from the hidden genome.

preprint2014arXiv

An Overview of Face Liveness Detection

Face recognition is a widely used biometric approach. Face recognition technology has developed rapidly in recent years and it is more direct, user friendly and convenient compared to other methods. But face recognition systems are vulnerable to spoof attacks made by non-real faces. It is an easy way to spoof face recognition systems by facial pictures such as portrait photographs. A secure system needs Liveness detection in order to guard against such spoofing. In this work, face liveness detection approaches are categorized based on the various types techniques used for liveness detection. This categorization helps understanding different spoof attacks scenarios and their relation to the developed solutions. A review of the latest works regarding face liveness detection works is presented. The main aim is to provide a simple path for the future development of novel and more secured face liveness detection approach.

Saptarshi Chakraborty

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Bregman Power k-Means for Clustering Exponential Family Data

Robust Linear Predictions: Analyses of Uniform Concentration, Fast Rates and Model Misspecification

Topical Hidden Genome: Discovering Latent Cancer Mutational Topics using a Bayesian Multilevel Context-learning Approach

Entropy Regularized Power k-Means Clustering

On the circular correlation coefficients for bivariate von Mises distributions on a torus

Principal Ellipsoid Analysis (PEA): Efficient non-linear dimension reduction & clustering

Using the "Hidden" Genome to Improve Classification of Cancer Types

An Overview of Face Liveness Detection