Researcher profile

Chihao Zhang

Chihao Zhang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

A Perfect Sampler for Hypergraph Independent Sets

The problem of uniformly sampling hypergraph independent sets is revisited. We design an efficient perfect sampler for the problem under a condition similar to that of the asymmetric Lovász Local Lemma. When applied to $d$-regular $k$-uniform hypergraphs on $n$ vertices, our sampler terminates in expected $O(n\log n)$ time provided $d\le c\cdot 2^{k/2}/k$ for some constant $c>0$. If in addition the hypergraph is linear, the condition can be weaken to $d\le c\cdot 2^{k}/k^2$ for some constant $c>0$, matching the rapid mixing condition for Glauber dynamics in Hermon, Sly and Zhang [HSZ19].

preprint2022arXiv

Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification

Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at https://github.com/JSB-UCLA/ITCA) implements ITCA and search strategies.

preprint2021arXiv

Matrix Normal PCA for Interpretable Dimension Reduction and Graphical Noise Modeling

Principal component analysis (PCA) is one of the most widely used dimension reduction and multivariate statistical techniques. From a probabilistic perspective, PCA seeks a low-dimensional representation of data in the presence of independent identical Gaussian noise. Probabilistic PCA (PPCA) and its variants have been extensively studied for decades. Most of them assume the underlying noise follows a certain independent identical distribution. However, the noise in the real world is usually complicated and structured. To address this challenge, some variants of PCA for data with non-IID noise have been proposed. However, most of the existing methods only assume that the noise is correlated in the feature space while there may exist two-way structured noise. To this end, we propose a powerful and intuitive PCA method (MN-PCA) through modeling the graphical noise by the matrix normal distribution, which enables us to explore the structure of noise in both the feature space and the sample space. MN-PCA obtains a low-rank representation of data and the structure of noise simultaneously. And it can be explained as approximating data over the generalized Mahalanobis distance. We develop two algorithms to solve this model: one maximizes the regularized likelihood, the other exploits the Wasserstein distance, which is more robust. Extensive experiments on various data demonstrate their effectiveness.

preprint2020arXiv

Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

Matrix decomposition is one of the fundamental tools to discover knowledge from big data generated by modern applications. However, it is still inefficient or infeasible to process very big data using such a method in a single machine. Moreover, big data are often distributedly collected and stored on different machines. Thus, such data generally bear strong heterogeneous noise. It is essential and useful to develop distributed matrix decomposition for big data analytics. Such a method should scale up well, model the heterogeneous noise, and address the communication issue in a distributed system. To this end, we propose a distributed Bayesian matrix decomposition model (DBMD) for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. We investigate the theoretical convergence behaviors of these algorithms. To address the heterogeneity of the noise, we propose an optimal plug-in weighted average that reduces the variance of the estimation. Synthetic experiments validate our theoretical results, and real-world experiments show that our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.

preprint2020arXiv

Rapid mixing from spectral independence beyond the Boolean domain

We extend the notion of spectral independence (introduced by Anari, Liu, and Oveis Gharan [ALO20]) from the Boolean domain to general discrete domains. This property characterises distributions with limited correlations, and implies that the corresponding Glauber dynamics is rapidly mixing. As a concrete application, we show that Glauber dynamics for sampling proper $q$-colourings mixes in polynomial-time for the family of triangle-free graphs with maximum degree $Δ$ provided $q\ge (α^*+δ)Δ$ where $α^*\approx 1.763$ is the unique solution to $α^*=\exp(1/α^*)$ and $δ>0$ is any constant. This is the first efficient algorithm for sampling proper $q$-colourings in this regime with possibly unbounded $Δ$. Our main tool of establishing spectral independence is the recursive coupling by Goldberg, Martin, and Paterson [GMP05].