Source author record

Lingxiao Huang

Lingxiao Huang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Artificial Intelligence Machine Learning Computational Geometry cs.CY Computation and Language Computer Science and Game Theory Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

11works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Clustering Aware Classification for Risk Prediction and Subtyping in Clinical Data

In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either 1) are classifier-specific and not generic, or 2) independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature, despite its importance in several real-world applications. In this paper, first, we theoretically analyze the generalization performance of classifiers trained on clustered data and find conditions under which clustering can potentially aid classification. This motivates the design of a simple k-means-based classification algorithm called Clustering Aware Classification (CAC) and its neural variant {DeepCAC}. DeepCAC effectively leverages deep representation learning to learn latent embeddings and finds clusters in a manner that make the clustered data suitable for training classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of DeepCAC over previous methods for combined clustering and classification.

preprint2022arXiv

A Roadmap for Big Model

With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view.

preprint2021arXiv

Fair Classification with Noisy Protected Attributes: A Framework with Provable Guarantees

We present an optimization framework for learning a fair classifier in the presence of noisy perturbations in the protected attributes. Compared to prior work, our framework can be employed with a very general class of linear and linear-fractional fairness constraints, can handle multiple, non-binary protected attributes, and outputs a classifier that comes with provable guarantees on both accuracy and fairness. Empirically, we show that our framework can be used to attain either statistical rate or false positive rate fairness guarantees with a minimal loss in accuracy, even when the noise is large, in two real-world datasets.

preprint2020arXiv

Approximation Algorithms for the Connected Sensor Cover Problem

We study the minimum connected sensor cover problem (MIN-CSC) and the budgeted connected sensor cover (Budgeted-CSC) problem, both motivated by important applications (e.g., reduce the communication cost among sensors) in wireless sensor networks. In both problems, we are given a set of sensors and a set of target points in the Euclidean plane. In MIN-CSC, our goal is to find a set of sensors of minimum cardinality, such that all target points are covered, and all sensors can communicate with each other (i.e., the communication graph is connected). We obtain a constant factor approximation algorithm, assuming that the ratio between the sensor radius and communication radius is bounded. In Budgeted-CSC problem, our goal is to choose a set of $B$ sensors, such that the number of targets covered by the chosen sensors is maximized and the communication graph is connected. We also obtain a constant approximation under the same assumption.

preprint2020arXiv

Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees

Developing classification algorithms that are fair with respect to sensitive attributes of the data has become an important problem due to the growing deployment of classification algorithms in various social contexts. Several recent works have focused on fairness with respect to a specific metric, modeled the corresponding fair classification problem as a constrained optimization problem, and developed tailored algorithms to solve them. Despite this, there still remain important metrics for which we do not have fair classifiers and many of the aforementioned algorithms do not come with theoretical guarantees; perhaps because the resulting optimization problem is non-convex. The main contribution of this paper is a new meta-algorithm for classification that takes as input a large class of fairness constraints, with respect to multiple non-disjoint sensitive attributes, and which comes with provable guarantees. This is achieved by first developing a meta-algorithm for a large family of classification problems with convex constraints, and then showing that classification problems with general types of fairness constraints can be reduced to those in this family. We present empirical results that show that our algorithm can achieve near-perfect fairness with respect to various fairness metrics, and that the loss in accuracy due to the imposed fairness constraints is often small. Overall, this work unifies several prior works on fair classification, presents a practical algorithm with theoretical guarantees, and can handle fairness metrics that were previously not possible.

preprint2020arXiv

Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

Given a collection of $n$ points in $\mathbb{R}^d$, the goal of the $(k,z)$-clustering problem is to find a subset of $k$ "centers" that minimizes the sum of the $z$-th powers of the Euclidean distance of each point to the closest center. Special cases of the $(k,z)$-clustering problem include the $k$-median and $k$-means problems. Our main result is a unified two-stage importance sampling framework that constructs an $\varepsilon$-coreset for the $(k,z)$-clustering problem. Compared to the results for $(k,z)$-clustering in [Feldman and Langberg, STOC 2011], our framework saves a $\varepsilon^2 d$ factor in the coreset size. Compared to the results for $(k,z)$-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a $\operatorname{poly}(k)$ factor in the coreset size and avoids the $\exp(k/\varepsilon)$ term in the construction time. Specifically, our coreset for $k$-median ($z=1$) has size $\tilde{O}(\varepsilon^{-4} k)$ which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a $k$ factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of $Ω\left(k\cdot \min \left\{2^{z/20},d \right\}\right)$ for a $0.01$-coreset for $(k,z)$-clustering, which has a linear dependence of size on $k$ and an exponential dependence on $z$ that matches our algorithmic results.

preprint2020arXiv

Stable and Fair Classification

Fair classification has been a topic of intense study in machine learning, and several algorithms have been proposed towards this important task. However, in a recent study, Friedler et al. observed that fair classification algorithms may not be stable with respect to variations in the training dataset -- a crucial consideration in several real-world applications. Motivated by their work, we study the problem of designing classification algorithms that are both fair and stable. We propose an extended framework based on fair classification algorithms that are formulated as optimization problems, by introducing a stability-focused regularization term. Theoretically, we prove a stability guarantee, that was lacking in fair classification algorithms, and also provide an accuracy guarantee for our extended framework. Our accuracy guarantee can be used to inform the selection of the regularization parameter in our framework. To the best of our knowledge, this is the first work that combines stability and fairness in automated decision-making tasks. We assess the benefits of our approach empirically by extending several fair classification algorithms that are shown to achieve the best balance between fairness and accuracy over the Adult dataset. Our empirical results show that our framework indeed improves the stability at only a slight sacrifice in accuracy.

preprint2016arXiv

$ε$-Kernel Coresets for Stochastic Points

With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing $ε$-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point's location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An $ε$-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an \expkernel), as well as the probability distribution on the width (an \probkernel) for any direction. We show that there exists a set of $O(1/ε^{(d-1)/2})$ deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an exp-kernel\ (or an fpow-kernel) is still possible. We also construct an quant-kernel coreset in linear time. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty.

preprint2016arXiv

Stochastic $k$-Center and $j$-Flat-Center Problems

Solving geometric optimization problems over uncertain data have become increasingly important in many applications and have attracted a lot of attentions in recent years. In this paper, we study two important geometric optimization problems, the $k$-center problem and the $j$-flat-center problem, over stochastic/uncertain data points in Euclidean spaces. For the stochastic $k$-center problem, we would like to find $k$ points in a fixed dimensional Euclidean space, such that the expected value of the $k$-center objective is minimized. For the stochastic $j$-flat-center problem, we seek a $j$-flat (i.e., a $j$-dimensional affine subspace) such that the expected value of the maximum distance from any point to the $j$-flat is minimized. We consider both problems under two popular stochastic geometric models, the existential uncertainty model, where the existence of each point may be uncertain, and the locational uncertainty model, where the location of each point may be uncertain. We provide the first PTAS (Polynomial Time Approximation Scheme) for both problems under the two models. Our results generalize the previous results for stochastic minimum enclosing ball and stochastic enclosing cylinder.

preprint2015arXiv

Canonical Paths for MCMC: from Art to Science

Markov Chain Monte Carlo (MCMC) method is a widely used algorithm design scheme with many applications. To make efficient use of this method, the key step is to prove that the Markov chain is rapid mixing. Canonical paths is one of the two main tools to prove rapid mixing. However, there are much fewer success examples comparing to coupling, the other main tool. The main reason is that there is no systematic approach or general recipe to design canonical paths. Building up on a previous exploration by McQuillan, we develop a general theory to design canonical paths for MCMC: We reduce the task of designing canonical paths to solving a set of linear equations, which can be automatically done even by a machine. Making use of this general approach, we obtain fully polynomial-time randomized approximation schemes (FPRAS) for counting the number of $b$-matching with $b\leq 7$ and $b$-edge-cover with $b\leq 2$. They are natural generalizations of matchings and edge covers for graphs. No polynomial time approximation was previously known for these problems.

preprint2014arXiv

The Multi-shop Ski Rental Problem

We consider the {\em multi-shop ski rental} problem. This problem generalizes the classic ski rental problem to a multi-shop setting, in which each shop has different prices for renting and purchasing a pair of skis, and a \emph{consumer} has to make decisions on when and where to buy. We are interested in the {\em optimal online (competitive-ratio minimizing) mixed strategy} from the consumer's perspective. For our problem in its basic form, we obtain exciting closed-form solutions and a linear time algorithm for computing them. We further demonstrate the generality of our approach by investigating three extensions of our basic problem, namely ones that consider costs incurred by entering a shop or switching to another shop. Our solutions to these problems suggest that the consumer must assign positive probability in \emph{exactly one} shop at any buying time. Our results apply to many real-world applications, ranging from cost management in \texttt{IaaS} cloud to scheduling in distributed computing.

Lingxiao Huang

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Clustering Aware Classification for Risk Prediction and Subtyping in Clinical Data

A Roadmap for Big Model

Fair Classification with Noisy Protected Attributes: A Framework with Provable Guarantees

Approximation Algorithms for the Connected Sensor Cover Problem

Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees

Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

Stable and Fair Classification

$ε$-Kernel Coresets for Stochastic Points

Stochastic $k$-Center and $j$-Flat-Center Problems

Canonical Paths for MCMC: from Art to Science

The Multi-shop Ski Rental Problem