Source author record

Anirban DasGupta

Anirban DasGupta appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Data Structures and Algorithms Machine Learning Artificial Intelligence Computer Science and Game Theory math.ST Methodology physics.soc-ph Social and Information Networks Statistics Theory

Catalog footprint

What is connected

16works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Valid confidence intervals for $μ, σ$ when there is only one observation available

Portnoy (2019) considered the problem of constructing an optimal confidence interval for the mean based on a single observation $\, X \sim {\cal{N}}(μ, \, σ^2) \,$. Here we extend this result to obtaining 1-sample confidence intervals for $\, σ\,$ and to cases of symmetric unimodal distributions and of distributions with compact support. Finally, we extend the multivariate result in Portnoy (2019) to allow a sample of size $\, m \,$ from a multivariate normal distribution where $m$ may be less than the dimension.

preprint2020arXiv

Efficient Hierarchical Clustering for Classification and Anomaly Detection

We address the problem of large scale real-time classification of content posted on social networks, along with the need to rapidly identify novel spam types. Obtaining manual labels for user-generated content using editorial labeling and taxonomy development lags compared to the rate at which new content type needs to be classified. We propose a class of hierarchical clustering algorithms that can be used both for efficient and scalable real-time multiclass classification as well as in detecting new anomalies in user-generated content. Our methods have low query time, linear space usage, and come with theoretical guarantees with respect to a specific hierarchical clustering cost function (Dasgupta, 2016). We compare our solutions against a range of classification techniques and demonstrate excellent empirical performance.

preprint2020arXiv

On Coresets For Regularized Regression

We study the effect of norm based regularization on the size of coresets for regression problems. Specifically, given a matrix $ \mathbf{A} \in {\mathbb{R}}^{n \times d}$ with $n\gg d$ and a vector $\mathbf{b} \in \mathbb{R} ^ n $ and $λ> 0$, we analyze the size of coresets for regularized versions of regression of the form $\|\mathbf{Ax}-\mathbf{b}\|_p^r + λ\|{\mathbf{x}}\|_q^s$ . Prior work has shown that for ridge regression (where $p,q,r,s=2$) we can obtain a coreset that is smaller than the coreset for the unregularized counterpart i.e. least squares regression (Avron et al). We show that when $r \neq s$, no coreset for regularized regression can have size smaller than the optimal coreset of the unregularized version. The well known lasso problem falls under this category and hence does not allow a coreset smaller than the one for least squares regression. We propose a modified version of the lasso problem and obtain for it a coreset of size smaller than the least square regression. We empirically show that the modified version of lasso also induces sparsity in solution, similar to the original lasso. We also obtain smaller coresets for $\ell_p$ regression with $\ell_p$ regularization. We extend our methods to multi response regularized regression. Finally, we empirically demonstrate the coreset performance for the modified lasso and the $\ell_1$ regression with $\ell_1$ regularization.

preprint2020arXiv

Scalable Estimation of Epidemic Thresholds via Node Sampling

Infectious or contagious diseases can be transmitted from one person to another through social contact networks. In today's interconnected global society, such contagion processes can cause global public health hazards, as exemplified by the ongoing Covid-19 pandemic. It is therefore of great practical relevance to investigate the network trans-mission of contagious diseases from the perspective of statistical inference. An important and widely studied boundary condition for contagion processes over networks is the so-called epidemic threshold. The epidemic threshold plays a key role in determining whether a pathogen introduced into a social contact network will cause an epidemic or die out. In this paper, we investigate epidemic thresholds from the perspective of statistical network inference. We identify two major challenges that are caused by high computational and sampling complexity of the epidemic threshold. We develop two statistically accurate and computationally efficient approximation techniques to address these issues under the Chung-Lu modeling framework. The second approximation, which is based on random walk sampling, further enjoys the advantage of requiring data on a vanishingly small fraction of nodes. We establish theoretical guarantees for both methods and demonstrate their empirical superiority.

preprint2020arXiv

Streaming Coresets for Symmetric Tensor Factorization

Factorizing tensors has recently become an important optimization module in a number of machine learning pipelines, especially in latent variable models. We show how to do this efficiently in the streaming setting. Given a set of $n$ vectors, each in $\mathbb{R}^d$, we present algorithms to select a sublinear number of these vectors as coreset, while guaranteeing that the CP decomposition of the $p$-moment tensor of the coreset approximates the corresponding decomposition of the $p$-moment tensor computed from the full data. We introduce two novel algorithmic techniques: online filtering and kernelization. Using these two, we present six algorithms that achieve different tradeoffs of coreset size, update time and working space, beating or matching various state of the art algorithms. In the case of matrices ($2$-ordered tensor), our online row sampling algorithm guarantees $(1 \pm ε)$ relative error spectral approximation. We show applications of our algorithms in learning single topic modeling.

preprint2016arXiv

A Framework for Estimating Stream Expression Cardinalities

Given $m$ distributed data streams $A_1, \dots, A_m$, we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over $A_1, \dots, A_m$. We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.

preprint2016arXiv

An Improved Algorithm for Eye Corner Detection

In this paper, a modified algorithm for the detection of nasal and temporal eye corners is presented. The algorithm is a modification of the Santos and Proenka Method. In the first step, we detect the face and the eyes using classifiers based on Haar-like features. We then segment out the sclera, from the detected eye region. From the segmented sclera, we segment out an approximate eyelid contour. Eye corner candidates are obtained using Harris and Stephens corner detector. We introduce a post-pruning of the Eye corner candidates to locate the eye corners, finally. The algorithm has been tested on Yale, JAFFE databases as well as our created database.

preprint2016arXiv

Evaluation of Denoising Techniques for EOG signals based on SNR Estimation

This paper evaluates four algorithms for denoising raw Electrooculography (EOG) data based on the Signal to Noise Ratio (SNR). The SNR is computed using the eigenvalue method. The filtering algorithms are a) Finite Impulse Response (FIR) bandpass filters, b) Stationary Wavelet Transform, c) Empirical Mode Decomposition (EMD) d) FIR Median Hybrid Filters. An EOG dataset has been prepared where the subject is asked to perform letter cancelation test on 20 subjects.

preprint2016arXiv

SPECFACE - A Dataset of Human Faces Wearing Spectacles

This paper presents a database of human faces for persons wearing spectacles. The database consists of images of faces having significant variations with respect to illumination, head pose, skin color, facial expressions and sizes, and nature of spectacles. The database contains data of 60 subjects. This database is expected to be a precious resource for the development and evaluation of algorithms for face detection, eye detection, head tracking, eye gaze tracking, etc., for subjects wearing spectacles. As such, this can be a valuable contribution to the computer vision community.

preprint2015arXiv

A Framework for Fast Face and Eye Detection

Face detection is an essential step in many computer vision applications like surveillance, tracking, medical analysis, facial expression analysis etc. Several approaches have been made in the direction of face detection. Among them, Haar-like features based method is a robust method. In spite of the robustness, Haar - like features work with some limitations. However, with some simple modifications in the algorithm, its performance can be made faster and more robust. The present work refers to the increase in speed of operation of the original algorithm by down sampling the frames and its analysis with different scale factors. It also discusses the detection of tilted faces using an affine transformation of the input image.

preprint2015arXiv

A Video Database of Human Faces under Near Infra-Red Illumination for Human Computer Interaction Aplications

Human Computer Interaction (HCI) is an evolving area of research for coherent communication between computers and human beings. Some of the important applications of HCI as reported in literature are face detection, face pose estimation, face tracking and eye gaze estimation. Development of algorithms for these applications is an active field of research. However, availability of standard database to validate such algorithms is insufficient. This paper discusses the creation of such a database created under Near Infra-Red (NIR) illumination. NIR illumination has gained its popularity for night mode applications since prolonged exposure to Infra-Red (IR) lighting may lead to many health issues. The database contains NIR videos of 60 subjects in different head orientations and with different facial expressions, facial occlusions and illumination variation. This new database can be a very valuable resource for development and evaluation of algorithms on face detection, eye detection, head tracking, eye gaze tracking etc. in NIR lighting.

preprint2015arXiv

A Vision Based System for Monitoring the Loss of Attention in Automotive Drivers

On board monitoring of the alertness level of an automotive driver has been a challenging research in transportation safety and management. In this paper, we propose a robust real time embedded platform to monitor the loss of attention of the driver during day as well as night driving conditions. The PERcentage of eye CLOSure (PERCLOS) has been used as the indicator of the alertness level. In this approach, the face is detected using Haar like features and tracked using a Kalman Filter. The Eyes are detected using Principal Component Analysis (PCA) during day time and the block Local Binary Pattern (LBP) features during night. Finally the eye state is classified as open or closed using Support Vector Machines(SVM). In plane and off plane rotations of the drivers face have been compensated using Affine and Perspective Transformation respectively. Compensation in illumination variation is carried out using Bi Histogram Equalization (BHE). The algorithm has been cross validated using brain signals and finally been implemented on a Single Board Computer (SBC) having Intel Atom processor, 1 GB RAM, 1.66 GHz clock, x86 architecture, Windows Embedded XP operating system. The system is found to be robust under actual driving conditions.

preprint2015arXiv

Fast Computation of PERCLOS and Saccadic Ratio

This thesis describes the development of fast algorithms for the computation of PERcentage CLOSure of eyes (PERCLOS) and Saccadic Ratio (SR). PERCLOS and SR are two ocular parameters reported to be measures of alertness levels in human beings. PERCLOS is the percentage of time in which at least 80% of the eyelid remains closed over the pupil. Saccades are fast and simultaneous movement of both the eyes in the same direction. SR is the ratio of peak saccadic velocity to the saccadic duration. This thesis addresses the issues of image based estimation of PERCLOS and SR, prevailing in the literature such as illumination variation, poor illumination conditions, head rotations etc. In this work, algorithms for real-time PERCLOS computation has been developed and implemented on an embedded platform. The platform has been used as a case study for assessment of loss of attention in automotive drivers. The SR estimation has been carried out offline as real-time implementation requires high frame rates of processing which is difficult to achieve due to hardware limitations. The accuracy in estimation of the loss of attention using PERCLOS and SR has been validated using brain signals, which are reported to be an authentic cue for estimating the state of alertness in human beings. The major contributions of this thesis include database creation, design and implementation of fast algorithms for estimating PERCLOS and SR on embedded computing platforms.

preprint2013arXiv

Crowdsourced Judgement Elicitation with Endogenous Proficiency

Crowdsourcing is now widely used to replace judgement by an expert authority with an aggregate evaluation from a number of non-experts, in applications ranging from rating and categorizing online content to evaluation of student assignments in massively open online courses via peer grading. A key issue in these settings, where direct monitoring is infeasible, is incentivizing agents in the `crowd' to put in effort to make good evaluations, as well as to truthfully report their evaluations. This leads to a new family of information elicitation problems with unobservable ground truth, where an agent's proficiency- the probability with which she correctly evaluates the underlying ground truth- is endogenously determined by her strategic choice of how much effort to put into the task. Our main contribution is a simple, new, mechanism for binary information elicitation for multiple tasks when agents have endogenous proficiencies, with the following properties: (i) Exerting maximum effort followed by truthful reporting of observations is a Nash equilibrium. (ii) This is the equilibrium with maximum payoff to all agents, even when agents have different maximum proficiencies, can use mixed strategies, and can choose a different strategy for each of their tasks. Our information elicitation mechanism requires only minimal bounds on the priors, asks agents to only report their own evaluations, and does not require any conditions on a diverging number of agent reports per task to achieve its incentive properties. The main idea behind our mechanism is to use the presence of multiple tasks and ratings to identify and penalize low-effort agreement: the mechanism rewards agents for agreeing with a `reference' rater on a task but also penalizes for blind agreement by subtracting out a statistic term designed so that agents obtain reward only when they put effort into their observations.

preprint2010arXiv

A Sparse Johnson--Lindenstrauss Transform

Dimension reduction is a key algorithmic tool with many applications including nearest-neighbor search, compressed sensing and linear algebra in the streaming model. In this work we obtain a {\em sparse} version of the fundamental tool in dimension reduction --- the Johnson--Lindenstrauss transform. Using hashing and local densification, we construct a sparse projection matrix with just $\tilde{O}(\frac{1}ε)$ non-zero entries per column. We also show a matching lower bound on the sparsity for a large class of projection matrices. Our bounds are somewhat surprising, given the known lower bounds of $Ω(\frac{1}{ε^2})$ both on the number of rows of any projection matrix and on the sparsity of projection matrices generated by natural constructions. Using this, we achieve an $\tilde{O}(\frac{1}ε)$ update time per non-zero element for a $(1\pmε)$-approximate projection, thereby substantially outperforming the $\tilde{O}(\frac{1}{ε^2})$ update time required by prior approaches. A variant of our method offers the same guarantees for sparse vectors, yet its $\tilde{O}(d)$ worst case running time matches the best approach of Ailon and Liberty.

preprint2010arXiv

Feature Hashing for Large Scale Multitask Learning

Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case -- multitask learning with hundreds of thousands of tasks.

Anirban DasGupta

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Valid confidence intervals for $μ, σ$ when there is only one observation available

Efficient Hierarchical Clustering for Classification and Anomaly Detection

On Coresets For Regularized Regression

Scalable Estimation of Epidemic Thresholds via Node Sampling

Streaming Coresets for Symmetric Tensor Factorization

A Framework for Estimating Stream Expression Cardinalities

An Improved Algorithm for Eye Corner Detection

Evaluation of Denoising Techniques for EOG signals based on SNR Estimation

SPECFACE - A Dataset of Human Faces Wearing Spectacles

A Framework for Fast Face and Eye Detection

A Video Database of Human Faces under Near Infra-Red Illumination for Human Computer Interaction Aplications

A Vision Based System for Monitoring the Loss of Attention in Automotive Drivers

Fast Computation of PERCLOS and Saccadic Ratio

Crowdsourced Judgement Elicitation with Endogenous Proficiency

A Sparse Johnson--Lindenstrauss Transform

Feature Hashing for Large Scale Multitask Learning