Source author record

Changlin Wan

Changlin Wan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computational Complexity Computational Geometry Distributed, Parallel, and Cluster Computing Information Retrieval

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised, heterogeneous models (i.e., different algorithms with varying hyperparameters) for further combination and analysis, rather than relying on a single model. How to accelerate the training and scoring on new-coming samples by outlyingness (referred as prediction throughout the paper) with a large number of unsupervised, heterogeneous OD models? In this study, we propose a modular acceleration system, called SUOD, to address it. The proposed system focuses on three complementary acceleration aspects (data reduction for high-dimensional data, approximation for costly models, and taskload imbalance optimization for distributed environment), while maintaining performance accuracy. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD's effectiveness in heterogeneous OD acceleration, along with a real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm. We open-source SUOD for reproducibility and accessibility.

preprint2020arXiv

A Proof for P =? NP Problem

The $\textbf{P}$ vs. $\textbf{NP}$ problem is an important problem in contemporary mathematics and theoretical computer science. Many proofs have been proposed to this problem. This paper proposes a theoretic proof for $\textbf{P}$ vs. $\textbf{NP}$ problem. The central idea of this proof is a recursive definition for Turing machine (shortly TM) that accepts the encoding strings of valid TMs. By the definition, an infinite sequence of TM is constructed, and it is proven that the sequence includes all valid TMs. Based on these TMs, the class $\textbf{D}$ that includes all decidable languages and the union and reduction operators are defined. By constructing a language $\textbf{Up}$ of the union of $\textbf{D}$, it is proved that $\textbf{P}=\textbf{Up}$ and $\textbf{Up}=\textbf{NP}$, and the result $\textbf{P}=\textbf{NP}$ is proven.

preprint2020arXiv

Denoising individual bias for a fairer binary submatrix detection

Low rank representation of binary matrix is powerful in disentangling sparse individual-attribute associations, and has received wide applications. Existing binary matrix factorization (BMF) or co-clustering (CC) methods often assume i.i.d background noise. However, this assumption could be easily violated in real data, where heterogeneous row- or column-wise probability of binary entries results in disparate element-wise background distribution, and paralyzes the rationality of existing methods. We propose a binary data denoising framework, namely BIND, which optimizes the detection of true patterns by estimating the row- or column-wise mixture distribution of patterns and disparate background, and eliminating the binary attributes that are more likely from the background. BIND is supported by thoroughly derived mathematical property of the row- and column-wise mixture distributions. Our experiment on synthetic and real-world data demonstrated BIND effectively removes background noise and drastically increases the fairness and accuracy of state-of-the arts BMF and CC methods.

preprint2020arXiv

Fast And Efficient Boolean Matrix Factorization By Geometric Segmentation

Boolean matrix has been used to represent digital information in many fields, including bank transaction, crime records, natural language processing, protein-protein interaction, etc. Boolean matrix factorization (BMF) aims to find an approximation of a binary matrix as the Boolean product of two low rank Boolean matrices, which could generate vast amount of information for the patterns of relationships between the features and samples. Inspired by binary matrix permutation theories and geometric segmentation, we developed a fast and efficient BMF approach called MEBF (Median Expansion for Boolean Factorization). Overall, MEBF adopted a heuristic approach to locate binary patterns presented as submatrices that are dense in 1's. At each iteration, MEBF permutates the rows and columns such that the permutated matrix is approximately Upper Triangular-Like (UTL) with so-called Simultaneous Consecutive-ones Property (SC1P). The largest submatrix dense in 1 would lies on the upper triangular area of the permutated matrix, and its location was determined based on a geometric segmentation of a triangular. We compared MEBF with other state of the art approaches on data scenarios with different sparsity and noise levels. MEBF demonstrated superior performances in lower reconstruction error, and higher computational efficiency, as well as more accurate sparse patterns than popular methods such as ASSO, PANDA and MP. We demonstrated the application of MEBF on both binary and non-binary data sets, and revealed its further potential in knowledge retrieving and data denoising.

Changlin Wan

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

A Proof for P =? NP Problem

Denoising individual bias for a fairer binary submatrix detection

Fast And Efficient Boolean Matrix Factorization By Geometric Segmentation