Source author record

Yili Hong

Yili Hong appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Computation Methodology Artificial Intelligence Distributed, Parallel, and Cluster Computing Human-Computer Interaction

Catalog footprint

What is connected

12works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

Multivariate Functional Clustering with Variable Selection and Application to Sensor Data from Engineering Systems

Multi-sensor data that track system operating behaviors are widely available nowadays from various engineering systems. Measurements from each sensor over time form a curve and can be viewed as functional data. Clustering of these multivariate functional curves is important for studying the operating patterns of systems. One complication in such applications is the possible presence of sensors whose data do not contain relevant information. Hence it is desirable for the clustering method to equip with an automatic sensor selection procedure. Motivated by a real engineering application, we propose a functional data clustering method that simultaneously removes noninformative sensors and groups functional curves into clusters using informative sensors. Functional principal component analysis is used to transform multivariate functional data into a coefficient matrix for data reduction. We then model the transformed data by a Gaussian mixture distribution to perform model-based clustering with variable selection. Three types of penalties, the individual, variable, and group penalties, are considered to achieve automatic variable selection. Extensive simulations are conducted to assess the clustering and variable selection performance of the proposed methods. The application of the proposed methods to an engineering system with multiple sensors shows the promise of the methods and reveals interesting patterns in the sensor data.

preprint2022arXiv

A Little Too Personal: Effects of Standardization versus Personalization on Job Acquisition, Work Completion, and Revenue for Online Freelancers

As more individuals consider permanently working from home, the online labor market continues to grow as an alternative working environment. While the flexibility and autonomy of these online gigs attracts many workers, success depends critically upon self-management and workers' efficient allocation of scarce resources. To achieve this, freelancers may develop alternative work strategies, employing highly standardized schedules and communication patterns while taking on large work volumes, or engaging in smaller numbers of jobs whilst tailoring their activities to build relationships with individual employers. In this study, we consider this contrast in relation to worker communication patterns. We demonstrate the heterogeneous effects of standardization versus personalization across different stages of a project and examine the relative impact on job acquisition, project completion, and earnings. Our findings can inform the design of platforms and various worker support tools for the gig economy.

preprint2022arXiv

Design Strategies and Approximation Methods for High-Performance Computing Variability Management

Performance variability management is an active research area in high-performance computing (HPC). We focus on input/output (I/O) variability. To study the performance variability, computer scientists often use grid-based designs (GBDs) to collect I/O variability data, and use mathematical approximation methods to build a prediction model. Mathematical approximation models could be biased particularly if extrapolations are needed. Space-filling designs (SFDs) and surrogate models such as Gaussian process (GP) are popular for data collection and building predictive models. The applicability of SFDs and surrogates in the HPC variability needs investigation. We investigate their applicability in the HPC setting in terms of design efficiency, prediction accuracy, and scalability. We first customize the existing SFDs so that they can be applied in the HPC setting. We conduct a comprehensive investigation of design strategies and the prediction ability of approximation methods. We use both synthetic data simulated from three test functions and the real data from the HPC setting. We then compare different methods in terms of design efficiency, prediction accuracy, and scalability. In synthetic and real data analysis, GP with SFDs outperforms in most scenarios. With respect to approximation models, GP is recommended if the data are collected by SFDs. If data are collected using GBDs, both GP and Delaunay can be considered. With the best choice of approximation method, the performance of SFDs and GBD depends on the property of the underlying surface. For the cases in which SFDs perform better, the number of design points needed for SFDs is about half of or less than that of the GBD to achieve the same prediction accuracy. SFDs that can be tailored to high dimension and non-smooth surface are recommended especially when large numbers of input factors need to be considered in the model.

preprint2022arXiv

Prediction for Distributional Outcomes in High-Performance Computing I/O Variability

Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management and is nontrivial because one needs to predict a distribution function based on system factors. In this paper, we propose a new framework to predict performance distributions. The proposed model is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We evaluate the performance of the proposed method by using the IOzone variability data based on various prediction tasks. Results show that the proposed method can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our methods can be further used as a surrogate model for HPC system variability monitoring and optimization.

preprint2022arXiv

The Poisson Multinomial Distribution and Its Applications in Voting Theory, Ecological Inference, and Machine Learning

The Poisson multinomial distribution (PMD) describes the distribution of the sum of $n$ independent but non-identically distributed random vectors, in which each random vector is of length $m$ with 0/1 valued elements and only one of its elements can take value 1 with a certain probability. Those probabilities are different for the $m$ elements across the $n$ random vectors, and form an $n \times m$ matrix with row sum equals to 1. We call this $n\times m$ matrix the success probability matrix (SPM). Each SPM uniquely defines a PMD. The PMD is useful in many areas such as, voting theory, ecological inference, and machine learning. The distribution functions of PMD, however, are usually difficult to compute. In this paper, we develop efficient methods to compute the probability mass function (pmf) for the PMD using multivariate Fourier transform, normal approximation, and simulations. We study the accuracy and efficiency of those methods and give recommendations for which methods to use under various scenarios. We also illustrate the use of the PMD via three applications, namely, in voting probability calculation, aggregated data inference, and uncertainty quantification in classification. We build an R package that implements the proposed methods, and illustrate the package with examples.

preprint2021arXiv

Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles

Artificial intelligence (AI) systems have become increasingly common and the trend will continue. Examples of AI systems include autonomous vehicles (AV), computer vision, natural language processing, and AI medical experts. To allow for safe and effective deployment of AI systems, the reliability of such systems needs to be assessed. Traditionally, reliability assessment is based on reliability test data and the subsequent statistical modeling and analysis. The availability of reliability data for AI systems, however, is limited because such data are typically sensitive and proprietary. The California Department of Motor Vehicles (DMV) oversees and regulates an AV testing program, in which many AV manufacturers are conducting AV road tests. Manufacturers participating in the program are required to report recurrent disengagement events to California DMV. This information is being made available to the public. In this paper, we use recurrent disengagement events as a representation of the reliability of the AI system in AV, and propose a statistical framework for modeling and analyzing the recurrent events data from AV driving tests. We use traditional parametric models in software reliability and propose a new nonparametric model based on monotonic splines to describe the event process. We develop inference procedures for selecting the best models, quantifying uncertainty, and testing heterogeneity in the event process. We then analyze the recurrent events data from four AV manufacturers, and make inferences on the reliability of the AI systems in AV. We also describe how the proposed analysis can be applied to assess the reliability of other AI systems.

preprint2021arXiv

Sequential Design of Computer Experiments with Quantitative and Qualitative Factors in Applications to HPC Performance Optimization

Computer experiments with both qualitative and quantitative factors are widely used in many applications. Motivated by the emerging need of optimal configuration in the high-performance computing (HPC) system, this work proposes a sequential design, denoted as adaptive composite exploitation and exploration (CEE), for optimization of computer experiments with qualitative and quantitative factors. The proposed adaptive CEE method combines the predictive mean and standard deviation based on the additive Gaussian process to achieve a meaningful balance between exploitation and exploration for optimization. Moreover, the adaptiveness of the proposed sequential procedure allows the selection of next design point from the adaptive design region. Theoretical justification of the adaptive design region is provided. The performance of the proposed method is evaluated by several numerical examples in simulations. The case study of HPC performance optimization further elaborates the merits of the proposed method.

preprint2016arXiv

ADDT: An R Package for Analysis of Accelerated Destructive Degradation Test Data

Accelerated destructive degradation tests (ADDT) are often used to collect necessary data for assessing the long-term properties of polymeric materials. Based on the data, a thermal index (TI) is estimated. The TI can be useful for material rating and comparisons. The R package ADDT provides the functionalities of performing the traditional method based on the least-squares method, the parametric method based on maximum likelihood estimation, and the semiparametric method based on spline methods for analyzing ADDT data, and then estimating the TI for polymeric materials. In this chapter, we provide a detailed introduction to the ADDT package. We provide a step-by-step illustration for the use of functions in the package. Publicly available datasets are used for illustrations.

preprint2016arXiv

Statistical Methods for Thermal Index Estimation Based on Accelerated Destructive Degradation Test Data

Accelerated destructive degradation test (ADDT) is a technique that is commonly used by industries to access material's long-term properties. In many applications, the accelerating variable is usually the temperature. In such cases, a thermal index (TI) is used to indicate the strength of the material. For example, a TI of 200C may be interpreted as the material can be expected to maintain a specific property at a temperature of 200C for 100,000 hours. A material with a higher TI possesses a stronger resistance to thermal damage. In literature, there are three methods available to estimate the TI based on ADDT data, which are the traditional method based on the least-squares approach, the parametric method, and the semiparametric method. In this chapter, we provide a comprehensive review of the three methods and illustrate how the TI can be estimated based on different models. We also conduct comprehensive simulation studies to show the properties of different methods. We provide thorough discussions on the pros and cons of each method. The comparisons and discussion in this chapter can be useful for practitioners and future industrial standards.

preprint2015arXiv

Semi-parametric Models for Accelerated Destructive Degradation Test Data Analysis

Accelerated destructive degradation tests (ADDT) are widely used in industry to evaluate materials' long term properties. Even though there has been tremendous statistical research in nonparametric methods, the current industrial practice is still to use application-specific parametric models to describe ADDT data. The challenge of using a nonparametric approach comes from the need to retain the physical meaning of degradation mechanisms and also perform extrapolation for predictions at the use condition. Motivated by this challenge, we propose a semi-parametric model to describe ADDT data. We use monotonic B-splines to model the degradation path, which not only provides flexible models with few assumptions, but also retains the physical meaning of degradation mechanisms (e.g., the degradation path is monotonically decreasing). Parametric models, such as the Arrhenius model, are used for modeling the relationship between the degradation and accelerating variable, allowing for extrapolation to the use conditions. We develop an efficient procedure to estimate model parameters. We also use simulation to validate the developed procedures and demonstrate the robustness of the semi-parametric model under model misspecification. Finally, the proposed method is illustrated by multiple industrial applications.

preprint2015arXiv

Survival and lifetime data analysis with a flexible class of distributions

We introduce a general class of continuous univariate distributions with positive support obtained by transforming the class of two-piece distributions. We show that this class of distributions is very flexible, easy to implement, and contains members that can capture different tail behaviours and shapes, producing also a variety of hazard functions. The proposed distributions represent a flexible alternative to the classical choices such as the log-normal, Gamma, and Weibull distributions. We investigate empirically the inferential properties of the proposed models through an extensive simulation study. We present some applications using real data in the contexts of time-to-event and accelerated failure time models. In the second kind of applications, we explore the use of these models in the estimation of the distribution of the individual remaining life.

preprint2014arXiv

How do heterogeneities in operating environments affect field failure predictions and test planning?

The main objective of accelerated life tests (ALTs) is to predict fraction failings of products in the field. However, there are often discrepancies between the predicted fraction failing from the lab testing data and that from the field failure data, due to the yet unobserved heterogeneities in usage and operating conditions. Most previous research on ALT planning and data analysis ignores the discrepancies, resulting in inferior test plans and biased predictions. In this paper we model the heterogeneous environments together with their effects on the product failures as a frailty term to link the lab failure time distribution and field failure time distribution of a product. We show that in the presence of the heterogeneous operating conditions, the hazard rate function of the field failure time distribution exhibits a range of shapes. Statistical inference procedure for the frailty models is developed when both the ALT data and the field failure data are available. Based on the frailty models, optimal ALT plans aimed at predicting the field failure time distribution are obtained. The developed methods are demonstrated through a real life example.

Yili Hong

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Multivariate Functional Clustering with Variable Selection and Application to Sensor Data from Engineering Systems

A Little Too Personal: Effects of Standardization versus Personalization on Job Acquisition, Work Completion, and Revenue for Online Freelancers

Design Strategies and Approximation Methods for High-Performance Computing Variability Management

Prediction for Distributional Outcomes in High-Performance Computing I/O Variability

The Poisson Multinomial Distribution and Its Applications in Voting Theory, Ecological Inference, and Machine Learning

Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles

Sequential Design of Computer Experiments with Quantitative and Qualitative Factors in Applications to HPC Performance Optimization

ADDT: An R Package for Analysis of Accelerated Destructive Degradation Test Data

Statistical Methods for Thermal Index Estimation Based on Accelerated Destructive Degradation Test Data

Semi-parametric Models for Accelerated Destructive Degradation Test Data Analysis

Survival and lifetime data analysis with a flexible class of distributions

How do heterogeneities in operating environments affect field failure predictions and test planning?