Source author record

Yutao Ma

Yutao Ma appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering math.PR Computer Vision Information Retrieval Machine Learning Social and Information Networks eess.IV

Catalog footprint

What is connected

19works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Deviation probabilities and Sharp Berry-Esseen bound for rightmost eigenvalue of large non-Hermitian chiral random matrices

This paper provides a quantitative analysis of the rightmost eigenvalue for a chiral non-Hermitian random Dirac matrix in the maximally non-Hermitian regime ($τ=0$). Let $(σ_i)_{1\le i\le n}$ be the eigenvalues with positive real part. We define the normalization constants \[ s_n = \frac{4n(n+v)}{2n+v}, \qquad γ_n = \frac{1}{2}\log s_n - \frac{5}{4}\log(\log s_n) - \log\bigl(2^{1/4}π\bigr), \] and the centered and scaled variable \[ X_n = \sqrt{2s_n\log s_n}\,\bigl(\bigl(\tfrac{n}{n+v}\bigr)^{1/4}\,\max_{1\le i\le n}\Reσ_i \;-\; 1 \;-\; \frac{γ_n}{\sqrt{2s_n\log s_n}}\bigr). \] Our main result is the following sharp Berry--Esseen bound for the convergence of $X_n$ to the Gumbel distribution: \[ \sup_{x \in \mathbb{R}} \bigl|\mathbb{P}(X_n \le x) - e^{-e^{-x}}\bigr| = \frac{25 (\log\log s_n)^2}{16 e \,\log s_n}\,\bigl(1 + o(1)\bigr), \] which holds as $n \to \infty$ for an arbitrary parameter $v \ge 0$ (which may depend on $n$). As a byproduct of our analysis, we also obtain precise large- and moderate-deviation principles for the scaled rightmost eigenvalue $\bigl(\frac{n}{n+v}\bigr)^{1/4} \max_{1\le i\le n}\Reσ_i$, characterizing its rate of convergence to the value $1$.

preprint2022arXiv

Cervical Optical Coherence Tomography Image Classification Based on Contrastive Self-Supervised Texture Learning

Background: Cervical cancer seriously affects the health of the female reproductive system. Optical coherence tomography (OCT) emerged as a non-invasive, high-resolution imaging technology for cervical disease detection. However, OCT image annotation is knowledge-intensive and time-consuming, which impedes the training process of deep-learning-based classification models. Purpose: This study aims to develop a computer-aided diagnosis (CADx) approach to classifying in-vivo cervical OCT images based on self-supervised learning. Methods: In addition to high-level semantic features extracted by a convolutional neural network (CNN), the proposed CADx approach leverages unlabeled cervical OCT images' texture features learned by contrastive texture learning. We conducted ten-fold cross-validation on the OCT image dataset from a multi-center clinical study on 733 patients from China. Results: In a binary classification task for detecting high-risk diseases, including high-grade squamous intraepithelial lesion and cervical cancer, our method achieved an area-under-the-curve value of 0.9798 plus or minus 0.0157 with a sensitivity of 91.17 plus or minus 4.99% and a specificity of 93.96 plus or minus 4.72% for OCT image patches; also, it outperformed two out of four medical experts on the test set. Furthermore, our method achieved a 91.53% sensitivity and 97.37% specificity on an external validation dataset containing 287 3D OCT volumes from 118 Chinese patients in a new hospital using a cross-shaped threshold voting strategy. Conclusions: The proposed contrastive-learning-based CADx method outperformed the end-to-end CNN models and provided better interpretability based on texture features, which holds great potential to be used in the clinical protocol of "see-and-treat."

preprint2022arXiv

Deep Learning Framework for Multi-Round Service Bundle Recommendation in Iterative Mashup Development

Recent years have witnessed the rapid development of service-oriented computing technologies. The boom of Web services increases software developers' selection burden in developing new service-based systems such as mashups. Timely recommending appropriate component services for developers to build new mashups has become a fundamental problem in service-oriented software engineering. Existing service recommendation approaches are mainly designed for mashup development in the single-round scenario. It is hard for them to effectively update recommendation results according to developers' requirements and behaviours (e.g. instant service selection). To address this issue, the authors propose a service bundle recommendation framework based on deep learning, DLISR, which aims to capture the interactions among the target mashup to build, selected (component) services, and the following service to recommend. Moreover, an attention mechanism is employed in DLISR to weigh selected services when recommending a candidate service. The authors also design two separate models for learning interactions from the perspectives of content and invocation history, respectively, and a hybrid model called HISR. Experiments on a real-world dataset indicate that HISR can outperform several state-of-the-art service recommendation methods to develop new mashups iteratively.

preprint2022arXiv

Position-enhanced and Time-aware Graph Convolutional Network for Sequential Recommendations

Most of the existing deep learning-based sequential recommendation approaches utilize the recurrent neural network architecture or self-attention to model the sequential patterns and temporal influence among a user's historical behavior and learn the user's preference at a specific time. However, these methods have two main drawbacks. First, they focus on modeling users' dynamic states from a user-centric perspective and always neglect the dynamics of items over time. Second, most of them deal with only the first-order user-item interactions and do not consider the high-order connectivity between users and items, which has recently been proved helpful for the sequential recommendation. To address the above problems, in this article, we attempt to model user-item interactions by a bipartite graph structure and propose a new recommendation approach based on a Position-enhanced and Time-aware Graph Convolutional Network (PTGCN) for the sequential recommendation. PTGCN models the sequential patterns and temporal dynamics between user-item interactions by defining a position-enhanced and time-aware graph convolution operation and learning the dynamic representations of users and items simultaneously on the bipartite graph with a self-attention aggregator. Also, it realizes the high-order connectivity between users and items by stacking multi-layer graph convolutions. To demonstrate the effectiveness of PTGCN, we carried out a comprehensive evaluation of PTGCN on three real-world datasets of different sizes compared with a few competitive baselines. Experimental results indicate that PTGCN outperforms several state-of-the-art models in terms of two commonly-used evaluation metrics for ranking.

preprint2021arXiv

A Spatial-Temporal Graph Neural Network Framework for Automated Software Bug Triaging

The bug triaging process, an essential process of assigning bug reports to the most appropriate developers, is related closely to the quality and costs of software development. As manual bug assignment is a labor-intensive task, especially for large-scale software projects, many machine-learning-based approaches have been proposed to automatically triage bug reports. Although developer collaboration networks (DCNs) are dynamic and evolving in the real-world, most automated bug triaging approaches focus on static tossing graphs at a single time slice. Also, none of the previous studies consider periodic interactions among developers. To address the problems mentioned above, in this article, we propose a novel spatial-temporal dynamic graph neural network (ST-DGNN) framework, including a joint random walk (JRWalk) mechanism and a graph recurrent convolutional neural network (GRCNN) model. In particular, JRWalk aims to sample local topological structures in a graph with two sampling strategies by considering both node importance and edge importance. GRCNN has three components with the same structure, i.e., hourly-periodic, daily-periodic, and weekly-periodic components, to learn the spatial-temporal features of dynamic DCNs. We evaluated our approach's effectiveness by comparing it with several state-of-the-art graph representation learning methods in two domain-specific tasks that belong to node classification. In the two tasks, experiments on two real-world, large-scale developer collaboration networks collected from the Eclipse and Mozilla projects indicate that the proposed approach outperforms all the baseline methods.

preprint2020arXiv

DAN-SNR: A Deep Attentive Network for Social-Aware Next Point-of-Interest Recommendation

Next (or successive) point-of-interest (POI) recommendation has attracted increasing attention in recent years. Most of the previous studies attempted to incorporate the spatiotemporal information and sequential patterns of user check-ins into recommendation models to predict the target user's next move. However, none of these approaches utilized the social influence of each user's friends. In this study, we discuss a new topic of next POI recommendation and present a deep attentive network for social-aware next POI recommendation called DAN-SNR. In particular, the DAN-SNR makes use of the self-attention mechanism instead of the architecture of recurrent neural networks to model sequential influence and social influence in a unified manner. Moreover, we design and implement two parallel channels to capture short-term user preference and long-term user preference as well as social influence, respectively. By leveraging multi-head self-attention, the DAN-SNR can model long-range dependencies between any two historical check-ins efficiently and weigh their contributions to the next destination adaptively. Also, we carried out a comprehensive evaluation using large-scale real-world datasets collected from two popular location-based social networks, namely Gowalla and Brightkite. Experimental results indicate that the DAN-SNR outperforms seven competitive baseline approaches regarding recommendation performance and is of high efficiency among six neural-network- and attention-based methods.

preprint2020arXiv

On Stein's factors for Poisson approximation in Wasserstein distance with non-linear transportation costs

We establish various bounds on the solutions to a Stein equation for Poisson approximation in Wasserstein distance with non-linear transportation costs. The proofs are a refinement of those in [Barbour and Xia (2006)] using the results in [Liu and Ma (2009)]. As a corollary, we obtain an estimate of Poisson approximation error measured in L^2-Wasserstein distance.

preprint2019arXiv

Computer-aided diagnosis in histopathological images of the endometrium using a convolutional neural network and attention mechanisms

Uterine cancer, also known as endometrial cancer, can seriously affect the female reproductive organs, and histopathological image analysis is the gold standard for diagnosing endometrial cancer. However, due to the limited capability of modeling the complicated relationships between histopathological images and their interpretations, these computer-aided diagnosis (CADx) approaches based on traditional machine learning algorithms often failed to achieve satisfying results. In this study, we developed a CADx approach using a convolutional neural network (CNN) and attention mechanisms, called HIENet. Because HIENet used the attention mechanisms and feature map visualization techniques, it can provide pathologists better interpretability of diagnoses by highlighting the histopathological correlations of local (pixel-level) image features to morphological characteristics of endometrial tissue. In the ten-fold cross-validation process, the CADx approach, HIENet, achieved a 76.91 $\pm$ 1.17% (mean $\pm$ s. d.) classification accuracy for four classes of endometrial tissue, namely normal endometrium, endometrial polyp, endometrial hyperplasia, and endometrial adenocarcinoma. Also, HIENet achieved an area-under-the-curve (AUC) of 0.9579 $\pm$ 0.0103 with an 81.04 $\pm$ 3.87% sensitivity and 94.78 $\pm$ 0.87% specificity in a binary classification task that detected endometrioid adenocarcinoma (Malignant). Besides, in the external validation process, HIENet achieved an 84.50% accuracy in the four-class classification task, and it achieved an AUC of 0.9829 with a 77.97% (95% CI, 65.27%-87.71%) sensitivity and 100% (95% CI, 97.42%-100.00%) specificity. In summary, the proposed CADx approach, HIENet, outperformed three human experts and four end-to-end CNN-based classifiers on this small-scale dataset composed of 3,500 hematoxylin and eosin (H&E) images regarding overall classification performance.

preprint2018arXiv

Computer-Aided Diagnosis of Label-Free 3-D Optical Coherence Microscopy Images of Human Cervical Tissue

Objective: Ultrahigh-resolution optical coherence microscopy (OCM) has recently demonstrated its potential for accurate diagnosis of human cervical diseases. One major challenge for clinical adoption, however, is the steep learning curve clinicians need to overcome to interpret OCM images. Developing an intelligent technique for computer-aided diagnosis (CADx) to accurately interpret OCM images will facilitate clinical adoption of the technology and improve patient care. Methods: 497 high-resolution 3-D OCM volumes (600 cross-sectional images each) were collected from 159 ex vivo specimens of 92 female patients. OCM image features were extracted using a convolutional neural network (CNN) model, concatenated with patient information (e.g., age, HPV results), and classified using a support vector machine classifier. Ten-fold cross-validations were utilized to test the performance of the CADx method in a five-class classification task and a binary classification task. Results: An 88.3 plus or minus 4.9% classification accuracy was achieved for five fine-grained classes of cervical tissue, namely normal, ectropion, low-grade and high-grade squamous intraepithelial lesions (LSIL and HSIL), and cancer. In the binary classification task (low-risk [normal, ectropion and LSIL] vs. high-risk [HSIL and cancer]), the CADx method achieved an area-under-the-curve (AUC) value of 0.959 with an 86.7 plus or minus 11.4% sensitivity and 93.5 plus or minus 3.8% specificity. Conclusion: The proposed deep-learning based CADx method outperformed three human experts. It was also able to identify morphological characteristics in OCM images that were consistent with histopathological interpretations. Significance: Label-free OCM imaging, combined with deep-learning based CADx methods, hold a great promise to be used in clinical settings for the effective screening and diagnosis of cervical diseases.

preprint2016arXiv

TDSelector: A Training Data Selection Method for Cross-Project Defect Prediction

In recent years, cross-project defect prediction (CPDP) attracted much attention and has been validated as a feasible way to address the problem of local data sparsity in newly created or inactive software projects. Unfortunately, the performance of CPDP is usually poor, and low quality training data selection has been regarded as a major obstacle to achieving better prediction results. To the best of our knowledge, most of existing approaches related to this topic are only based on instance similarity. Therefore, the objective of this work is to propose an improved training data selection method for CPDP that considers both similarity and the number of defects each training instance has (denoted by defects), which is referred to as TDSelector, and to demonstrate the effectiveness of the proposed method. Our experiments were conducted on 14 projects (including 15 data sets) collected from two public repositories. The results indicate that, in a specific CPDP scenario, the TDSelector-based bug predictor performs, on average, better than those based on the baseline methods, and the AUC (area under ROC curve) values are increased by up to 10.6 and 4.3%, respectively. Besides, an additional experiment shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method.

preprint2015arXiv

A Hybrid Approach to Web Service Recommendation Based on QoS-Aware Rating and Ranking

As the number of Web services with the same or similar functions increases steadily on the Internet, nowadays more and more service consumers pay great attention to the non-functional properties of Web services, also known as quality of service (QoS), when finding and selecting appropriate Web services. For most of the QoS-aware Web service recommendation systems, the list of recommended Web services is generally obtained based on a rating-oriented prediction approach, aiming at predicting the potential ratings that an active user may assign to the unrated services as accurately as possible. However, in some application scenarios, high accuracy of rating prediction may not necessarily lead to a satisfactory recommendation result. In this paper, we propose a ranking-oriented hybrid approach by combining the item-based collaborative filtering and latent factor models to address the problem of Web services ranking. In particular, the similarity between two Web services is measured in terms of the correlation coefficient between their rankings instead of between the traditional QoS ratings. Besides, we also improve the measure NDCG (Normalized Discounted Cumulative Gain) for evaluating the accuracy of the top K recommendations returned in ranked order. Comprehensive experiments on the QoS data set composed of real-world Web services are conducted to test our approach, and the experimental results demonstrate that our approach outperforms other competing approaches.

preprint2015arXiv

Log-Sobolev, isoperimetry and transport inequalities on graphs

In this paper, we study some functional inequalities (such as Poincaré inequalities, logarithmic Sobolev inequalities, generalized Cheeger isoperimetric inequalities, transportation-information inequalities and transportation-entropy inequalities) for reversible nearest-neighbor Markov processes on a connected finite graph by means of (random) path method. We provide estimates of the involved constants.

preprint2014arXiv

A note on spectral gap and weighted Poincaré inequalities for some one-dimensional diffusions

We present some classical and weighted Poincaré inequalities for some one-dimensional probability measures. This work is the one-dimensional counterpart of a recent study achieved by the authors for a class of spherically symmetric probability measures in dimension larger than 2. Our strategy is based on two main ingredients: on the one hand, the optimal constant in the desired weighted Poincaré inequality has to be rewritten as the spectral gap of a convenient Markovian diffusion operator, and on the other hand we use a recent result given by the two first authors, which allows to estimate precisely this spectral gap. In particular we are able to capture its exact value for some examples.

preprint2014arXiv

An Analysis of Research in Software Engineering: Assessment and Trends

Glass published the first report on the assessment of systems and software engineering scholars and institutions two decades ago. The ongoing, annual survey of publications in this field provides fund managers, young scholars, graduate students, etc. with useful information for different purposes. However, the studies have been questioned by some critics because of a few shortcomings of the evaluation method. It is actually very hard to reach a widely recognized consensus on such an assessment of scholars and institutions. This paper presents a module and automated method for assessment and trends analysis in software engineering compared with the prior studies. To achieve a more reasonable evaluation result, we take into consideration more high-quality publications, the rank of each publication analyzed, and the different roles of authors named on each paper in question. According to the 7638 papers published in 36 publications from 2008 to 2013, the statistics of research subjects roughly follow power laws, implying the interesting Matthew Effect. We then identify the Top 20 scholars, institutions and countries or regions in terms of a new evaluation rule based on the frequently-used one. The top-ranked scholar is Mark Harman of the University College London, UK, the top-ranked institution is the University of California, USA, and the top-ranked country is the USA. Besides, we also show two levels of trend changes based on the EI classification system and user-defined uncontrolled keywords, as well as noteworthy scholars and institutions in a specific research area. We believe that our results would provide a valuable insight for young scholars and graduate students to seek possible potential collaborators and grasp the popular research topics in software engineering.

preprint2014arXiv

An Empirical Study on Software Defect Prediction with a Simplified Metric Set

Software defect prediction plays a crucial role in estimating the most defect-prone components of software, and a large number of studies have pursued improving prediction accuracy within a project or across projects. However, the rules for making an appropriate decision between within- and cross-project defect prediction when available historical data are insufficient remain unclear. The objective of this work is to validate the feasibility of the predictor built with a simplified metric set for software defect prediction in different scenarios, and to investigate practical guidelines for the choice of training data, classifier and metric subset of a given project. First, based on six typical classifiers, we constructed three types of predictors using the size of software metric set in three scenarios. Then, we validated the acceptable performance of the predictor based on Top-k metrics in terms of statistical methods. Finally, we attempted to minimize the Top-k metric subset by removing redundant metrics, and we tested the stability of such a minimum metric subset with one-way ANOVA tests. The experimental results indicate that (1) the choice of training data should depend on the specific requirement of prediction accuracy; (2) the predictor built with a simplified metric set works well and is very useful in case limited resources are supplied; (3) simple classifiers (e.g., Naive Bayes) also tend to perform well when using a simplified metric set for defect prediction; and (4) in several cases, the minimum metric subset can be identified to facilitate the procedure of general defect prediction with acceptable loss of prediction precision in practice. The guideline for choosing a suitable simplified metric set in different scenarios is presented in Table 12.

preprint2014arXiv

Simplification of Training Data for Cross-Project Defect Prediction

Cross-project defect prediction (CPDP) plays an important role in estimating the most likely defect-prone software components, especially for new or inactive projects. To the best of our knowledge, few prior studies provide explicit guidelines on how to select suitable training data of quality from a large number of public software repositories. In this paper, we have proposed a training data simplification method for practical CPDP in consideration of multiple levels of granularity and filtering strategies for data sets. In addition, we have also provided quantitative evidence on the selection of a suitable filter in terms of defect-proneness ratio. Based on an empirical study on 34 releases of 10 open-source projects, we have elaborately compared the prediction performance of different defect predictors built with five well-known classifiers using training data simplified at different levels of granularity and with two popular filters. The results indicate that when using the multi-granularity simplification method with an appropriate filter, the prediction models based on Naive Bayes can achieve fairly good performance and outperform the benchmark method.

preprint2014arXiv

Spectral gap for spherically symmetric log-concave probability measures, and beyond

Let $μ$ be a probability measure on $\rr^n$ ($n \geq 2$) with Lebesgue density proportional to $e^{-V (\Vert x\Vert )}$, where $V : \rr_+ \to \rr$ is a smooth convex potential. We show that the associated spectral gap in $L^2 (μ)$ lies between $(n-1) / \int_{\rr^n} \Vert x\Vert ^2 μ(dx)$ and $n / \int_{\rr^n} \Vert x\Vert ^2 μ(dx)$, improving a well-known two-sided estimate due to Bobkov. Our Markovian approach is remarkably simple and is sufficiently robust to be extended beyond the log-concave case, at the price of potentially modifying the underlying dynamics in the energy, leading to weighted Poincaré inequalities. All our results are illustrated by some classical and less classical examples.

preprint2014arXiv

Towards Cross-Project Defect Prediction with Imbalanced Feature Sets

Cross-project defect prediction (CPDP) has been deemed as an emerging technology of software quality assurance, especially in new or inactive projects, and a few improved methods have been proposed to support better defect prediction. However, the regular CPDP always assumes that the features of training and test data are all identical. Hence, very little is known about whether the method for CPDP with imbalanced feature sets (CPDP-IFS) works well. Considering the diversity of defect data sets available on the Internet as well as the high cost of labeling data, to address the issue, in this paper we proposed a simple approach according to a distribution characteristic-based instance (object class) mapping, and demonstrated the validity of our method based on three public defect data sets (i.e., PROMISE, ReLink and AEEEM). Besides, the empirical results indicate that the hybrid model composed of CPDP and CPDP-IFS does improve the prediction performance of the regular CPDP to some extent.

preprint2013arXiv

Dynamics of Open-Source Software Developer's Commit Behavior: An Empirical Investigation of Subversion

Commit is an important operation of revision control for open-source software (OSS). Recent research has been pursued to explore the statistical laws of such an operation, but few of those papers conduct empirical investigations on commit interval (i.e., the waiting time between two consecutive commits). In this paper, we investigated software developer's collective and individual commit behavior in terms of the distribution of commit intervals, and found that 1) the data sets of project-level commit interval within both the lifecycle and each release of the projects analyzed roughly follow power-law distributions; and 2) lifecycle- and release-level collective commit interval on class files can also be best fitted with power laws. These findings reveal some general (collective) collaborative development patterns of OSS projects, e.g., most of the waiting times between two consecutive commits to a central repository are short, but only a few of them experience a long duration of waiting. Then, the implications of what we found for OSS research were outlined, which could provide an insight into understanding OSS development processes better based on software developers' historical commit behavior.

Yutao Ma

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Deviation probabilities and Sharp Berry-Esseen bound for rightmost eigenvalue of large non-Hermitian chiral random matrices

Cervical Optical Coherence Tomography Image Classification Based on Contrastive Self-Supervised Texture Learning

Deep Learning Framework for Multi-Round Service Bundle Recommendation in Iterative Mashup Development

Position-enhanced and Time-aware Graph Convolutional Network for Sequential Recommendations

A Spatial-Temporal Graph Neural Network Framework for Automated Software Bug Triaging

DAN-SNR: A Deep Attentive Network for Social-Aware Next Point-of-Interest Recommendation

On Stein's factors for Poisson approximation in Wasserstein distance with non-linear transportation costs

Computer-aided diagnosis in histopathological images of the endometrium using a convolutional neural network and attention mechanisms

Computer-Aided Diagnosis of Label-Free 3-D Optical Coherence Microscopy Images of Human Cervical Tissue

TDSelector: A Training Data Selection Method for Cross-Project Defect Prediction

A Hybrid Approach to Web Service Recommendation Based on QoS-Aware Rating and Ranking

Log-Sobolev, isoperimetry and transport inequalities on graphs

A note on spectral gap and weighted Poincaré inequalities for some one-dimensional diffusions

An Analysis of Research in Software Engineering: Assessment and Trends

An Empirical Study on Software Defect Prediction with a Simplified Metric Set

Simplification of Training Data for Cross-Project Defect Prediction

Spectral gap for spherically symmetric log-concave probability measures, and beyond

Towards Cross-Project Defect Prediction with Imbalanced Feature Sets

Dynamics of Open-Source Software Developer's Commit Behavior: An Empirical Investigation of Subversion