Source author record

Arin Chaudhuri

Arin Chaudhuri appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Applications Artificial Intelligence Computer Vision Methodology Numerical Analysis

Catalog footprint

What is connected

4works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

The Trace Criterion for Kernel Bandwidth Selection for Support Vector Data Description

Support vector data description (SVDD) is a popular anomaly detection technique. The SVDD classifier partitions the whole data space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, for which the Gaussian kernel is a common choice. The Gaussian kernel has a bandwidth parameter, and it is important to set the value of this parameter correctly for good results. A small bandwidth leads to overfitting such that the resulting SVDD classifier overestimates the number of anomalies, whereas a large bandwidth leads to underfitting and an inability to detect many anomalies. In this paper, we present a new unsupervised method for selecting the Gaussian kernel bandwidth. Our method exploits a low-rank representation of the kernel matrix to suggest a kernel bandwidth value. Our new technique is competitive with the current state of the art for low-dimensional data and performs extremely well for many classes of high-dimensional data. Because the mathematical formulation of SVDD is identical with the mathematical formulation of one-class support vector machines (OCSVM) when the Gaussian kernel is used, our method is equally applicable to Gaussian kernel bandwidth tuning for OCSVM.

preprint2020arXiv

Visualizing the Finer Cluster Structure of Large-Scale and High-Dimensional Data

Dimension reduction and visualization of high-dimensional data have become very important research topics because of the rapid growth of large databases in data science. In this paper, we propose using a generalized sigmoid function to model the distance similarity in both high- and low-dimensional spaces. In particular, the parameter b is introduced to the generalized sigmoid function in low-dimensional space, so that we can adjust the heaviness of the function tail by changing the value of b. Using both simulated and real-world data sets, we show that our proposed method can generate visualization results comparable to those of uniform manifold approximation and projection (UMAP), which is a newly developed manifold learning technique with fast running speed, better global structure, and scalability to massive data sets. In addition, according to the purpose of the study and the data structure, we can decrease or increase the value of b to either reveal the finer cluster structure of the data or maintain the neighborhood continuity of the embedding for better visualization. Finally, we use domain knowledge to demonstrate that the finer subclusters revealed with small values of b are meaningful.

preprint2019arXiv

Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection

In recent years, there have been many practical applications of anomaly detection such as in predictive maintenance, detection of credit fraud, network intrusion, and system failure. The goal of anomaly detection is to identify in the test data anomalous behaviors that are either rare or unseen in the training data. This is a common goal in predictive maintenance, which aims to forecast the imminent faults of an appliance given abundant samples of normal behaviors. Local outlier factor (LOF) is one of the state-of-the-art models used for anomaly detection, but the predictive performance of LOF depends greatly on the selection of hyperparameters. In this paper, we propose a novel, heuristic methodology to tune the hyperparameters in LOF. A tuned LOF model that uses the proposed method shows good predictive performance in both simulations and real data sets.

preprint2016arXiv

Leveraging Unstructured Data to Detect Emerging Reliability Issues

Unstructured data refers to information that does not have a predefined data model or is not organized in a pre-defined manner. Loosely speaking, unstructured data refers to text data that is generated by humans. In after-sales service businesses, there are two main sources of unstructured data: customer complaints, which generally describe symptoms, and technician comments, which outline diagnostics and treatment information. A legitimate customer complaint can eventually be tracked to a failure or a claim. However, there is a delay between the time of a customer complaint and the time of a failure or a claim. A proactive strategy aimed at analyzing customer complaints for symptoms can help service providers detect reliability problems in advance and initiate corrective actions such as recalls. This paper introduces essential text mining concepts in the context of reliability analysis and a method to detect emerging reliability issues. The application of the method is illustrated using a case study.