Source author record

Yuan-chin Ivan Chang

Yuan-chin Ivan Chang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Machine Learning Computation

Catalog footprint

What is connected

8works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Efficient Data Reduction Via PCA-Guided Quantile Based Sampling

In large-scale statistical modeling, reducing data size through subsampling is essential for balancing computational efficiency and statistical accuracy. We propose a new method, Principal Component Analysis guided Quantile Sampling (PCA-QS), which projects data onto principal components and applies quantile-based sampling to retain representative and diverse subsets. Compared with uniform random sampling, leverage score sampling, and coreset methods, PCA-QS consistently achieves lower mean squared error and better preservation of key data characteristics, while also being computationally efficient. This approach is adaptable to a variety of data scenarios and shows strong potential for broad applications in statistical computing.

preprint2026arXiv

Integrating Multi-Armed Bandit, Active Learning, and Distributed Computing for Scalable Optimization

Modern optimization problems in scientific and engineering domains often rely on expensive black-box evaluations, such as those arising in physical simulations or deep learning pipelines, where gradient information is unavailable or unreliable. In these settings, conventional optimization methods quickly become impractical due to prohibitive computational costs and poor scalability. We propose ALMAB-DC, a unified and modular framework for scalable black-box optimization that integrates active learning, multi-armed bandits, and distributed computing, with optional GPU acceleration. The framework leverages surrogate modeling and information-theoretic acquisition functions to guide informative sample selection, while bandit-based controllers dynamically allocate computational resources across candidate evaluations in a statistically principled manner. These decisions are executed asynchronously within a distributed multi-agent system, enabling high-throughput parallel evaluation. We establish theoretical regret bounds for both UCB-based and Thompson-sampling-based variants and develop a scalability analysis grounded in Amdahl's and Gustafson's laws. Empirical results across synthetic benchmarks, reinforcement learning tasks, and scientific simulation problems demonstrate that ALMAB-DC consistently outperforms state-of-the-art black-box optimizers. By design, ALMAB-DC is modular, uncertainty-aware, and extensible, making it particularly well suited for high-dimensional, resource-intensive optimization challenges.

preprint2026arXiv

PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling

We introduce Principal Component Analysis guided Quantile Sampling (PCA QS), a novel sampling framework designed to preserve both the statistical and geometric structure of large scale datasets. Unlike conventional PCA, which reduces dimensionality at the cost of interpretability, PCA QS retains the original feature space while using leading principal components solely to guide a quantile based stratification scheme. This principled design ensures that sampling remains representative without distorting the underlying data semantics. We establish rigorous theoretical guarantees, deriving convergence rates for empirical quantiles, Kullback Leibler divergence, and Wasserstein distance, thus quantifying the distributional fidelity of PCA QS samples. Practical guidelines for selecting the number of principal components, quantile bins, and sampling rates are provided based on these results. Extensive empirical studies on both synthetic and real-world datasets show that PCA QS consistently outperforms simple random sampling, yielding better structure preservation and improved downstream model performance. Together, these contributions position PCA QS as a scalable, interpretable, and theoretically grounded solution for efficient data summarization in modern machine learning workflows.

preprint2022arXiv

Determination of class-specific variables in nonparametric multiple-class classification

As technology advanced, collecting data via automatic collection devices become popular, thus we commonly face data sets with lengthy variables, especially when these data sets are collected without specific research goals beforehand. It has been pointed out in the literature that the difficulty of high-dimensional classification problems is intrinsically caused by too many noise variables useless for reducing classification error, which offer less benefits for decision-making, and increase complexity, and confusion in model-interpretation. A good variable selection strategy is therefore a must for using such kinds of data well; especially when we expect to use their results for the succeeding applications/studies, where the model-interpretation ability is essential. hus, the conventional classification measures, such as accuracy, sensitivity, precision, cannot be the only performance tasks. In this paper, we propose a probability-based nonparametric multiple-class classification method, and integrate it with the ability of identifying high impact variables for individual class such that we can have more information about its classification rule and the character of each class as well. The proposed method can have its prediction power approximately equal to that of the Bayes rule, and still retains the ability of "model-interpretation." We report the asymptotic properties of the proposed method, and use both synthesized and real data sets to illustrate its properties under different classification situations. We also separately discuss the variable identification, and training sample size determination, and summarize those procedures as algorithms such that users can easily implement them with different computing languages.

preprint2014arXiv

Active Learning Via Sequential Design and Uncertainty Sampling

Classification is an important task in many fields including biomedical research and machine learning. Traditionally, a classification rule is constructed based a bunch of labeled data. Recently, due to technological innovation and automatic data collection schemes, we easily encounter with data sets containing large amounts of unlabeled samples. Because to label each of them is usually costly and inefficient, how to utilize these unlabeled data in a classifier construction process becomes an important problem. In machine learning literature, active learning or semi-supervised learning are popular concepts discussed under this situation, where classification algorithms recruit new unlabeled subjects sequentially based on the information learned from previous stages of its learning process, and these new subjects are then labeled and included as new training samples. From a statistical aspect, these methods can be recognized as a hybrid of the sequential design and stochastic approximation procedure. In this paper, we study sequential learning procedures for building efficient and effective classifiers, where only the selected subjects are labeled and included in its learning stage. The proposed algorithm combines the ideas of Bayesian sequential optimal design and uncertainty sampling. Computational issues of the algorithm are discussed. Numerical results using both synthesized data and real examples are reported.

preprint2013arXiv

Sequential Estimation in Item Calibration with A Two-Stage Design

In this paper we apply a two-stage sequential design to item calibration problems under a three-parameter logistic model assumption. The measurement errors of the estimates of the latent trait levels of examinees are considered in our procedure. Moreover, a sequential procedure is employed to guarantee that the estimates of the parameters reach a prescribed accuracy criterion when the iteration is stopped, which fully takes the advantage of sequential design. Statistical properties of both the item parameter estimates and the sequential procedure are discussed. We compare the performance of the proposed method with that of the procedures based on some conventional designs using numerical studies.

preprint2011arXiv

Evaluating the diagnostic powers of variables and their linear combinations when the gold standard is continuous

The receiver operating characteristic (ROC) curve is a very useful tool for analyzing the diagnostic/classification power of instruments/classification schemes as long as a binary-scale gold standard is available. When the gold standard is continuous and there is no confirmative threshold, ROC curve becomes less useful. Hence, there are several extensions proposed for evaluating the diagnostic potential of variables of interest. However, due to the computational difficulties of these nonparametric based extensions, they are not easy to be used for finding the optimal combination of variables to improve the individual diagnostic power. Therefore, we propose a new measure, which extends the AUC index for identifying variables with good potential to be used in a diagnostic scheme. In addition, we propose a threshold gradient descent based algorithm for finding the best linear combination of variables that maximizes this new measure, which is applicable even when the number of variables is huge. The estimate of the proposed index and its asymptotic property are studied. The performance of the proposed method is illustrated using both synthesized and real data sets.

preprint2011arXiv

Sequential estimation for covariate-adjusted response-adaptive designs

In clinical trials, a covariate-adjusted response-adaptive (CARA) design allows a subject newly entering a trial a better chance of being allocated to a superior treatment regimen based on cumulative information from previous subjects, and adjusts the allocation according to individual covariate information. Since this design allocates subjects sequentially, it is natural to apply a sequential method for estimating the treatment effect in order to make the data analysis more efficient. In this paper, we study the sequential estimation of treatment effect for a general CARA design. A stopping criterion is proposed such that the estimates satisfy a prescribed precision when the sampling is stopped. The properties of estimates and stopping time} are obtained under the proposed stopping rule. In addition, we show that the asymptotic properties of the allocation function, under the proposed stopping rule, are the same as those obtained in the non-sequential/fixed sample size counterpart. We then illustrate the performance of the proposed procedure with some simulation results using logistic models. The properties, such as the coverage probability of treatment effect, correct allocation proportion and average sample size, for diverse combinations of initial sample sizes and tuning parameters in the utility function are discussed.

Yuan-chin Ivan Chang

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Efficient Data Reduction Via PCA-Guided Quantile Based Sampling

Integrating Multi-Armed Bandit, Active Learning, and Distributed Computing for Scalable Optimization

PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling

Determination of class-specific variables in nonparametric multiple-class classification

Active Learning Via Sequential Design and Uncertainty Sampling

Sequential Estimation in Item Calibration with A Two-Stage Design

Evaluating the diagnostic powers of variables and their linear combinations when the gold standard is continuous

Sequential estimation for covariate-adjusted response-adaptive designs