Source author record

Xintao Wu

Xintao Wu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Cryptography and Security Computation and Language cs.CY Databases Genomics Information Theory math.IT physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

12works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Histopathology-centered Computational Evolution of Spatial Omics: Integration, Mapping, and Foundation Models

Spatial omics (SO) technologies enable spatially resolved molecular profiling, while hematoxylin and eosin (H&E) imaging remains the gold standard for morphological assessment in clinical pathology. Recent computational advances increasingly place H&E images at the center of SO analysis, bridging morphology with transcriptomic, proteomic, and other spatial molecular modalities, and pushing resolution toward the single-cell level. In this survey, we systematically review the computational evolution of SO from a histopathology-centered perspective and organize existing methods into three paradigms: integration, which jointly models paired multimodal data; mapping, which infers molecular profiles from H&E images; and foundation models, which learn generalizable representations from large-scale spatial datasets. We analyze how the role of H&E images evolves across these paradigms from spatial context to predictive anchor and ultimately to representation backbone in response to practical constraints such as limited paired data and increasing resolution demands. We further summarize actionable modeling directions enabled by current architectures and delineate persistent gaps driven by data, biology, and technology that are unlikely to be resolved by model design alone. Together, this survey provides a histopathology-centered roadmap for developing and applying computational frameworks in SO.

preprint2026arXiv

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.

preprint2023arXiv

InfoFair: Information-Theoretic Intersectional Fairness

Algorithmic fairness is becoming increasingly important in data mining and machine learning. Among others, a foundational notation is group fairness. The vast majority of the existing works on group fairness, with a few exceptions, primarily focus on debiasing with respect to a single sensitive attribute, despite the fact that the co-existence of multiple sensitive attributes (e.g., gender, race, marital status, etc.) in the real-world is commonplace. As such, methods that can ensure a fair learning outcome with respect to all sensitive attributes of concern simultaneously need to be developed. In this paper, we study the problem of information-theoretic intersectional fairness (InfoFair), where statistical parity, a representative group fairness measure, is guaranteed among demographic groups formed by multiple sensitive attributes of interest. We formulate it as a mutual information minimization problem and propose a generic end-to-end algorithmic framework to solve it. The key idea is to leverage a variational representation of mutual information, which considers the variational distribution between learning outcomes and sensitive attributes, as well as the density ratio between the variational and the original distributions. Our proposed framework is generalizable to many different settings, including other statistical notions of fairness, and could handle any type of learning task equipped with a gradient-based optimizer. Empirical evaluations in the fair classification task on three real-world datasets demonstrate that our proposed framework can effectively debias the classification results with minimal impact to the classification accuracy.

preprint2022arXiv

Adaptive Fairness-Aware Online Meta-Learning for Changing Environments

The fairness-aware online learning framework has arisen as a powerful tool for the continual lifelong learning setting. The goal for the learner is to sequentially learn new tasks where they come one after another over time and the learner ensures the statistic parity of the new coming task across different protected sub-populations (e.g. race and gender). A major drawback of existing methods is that they make heavy use of the i.i.d assumption for data and hence provide static regret analysis for the framework. However, low static regret cannot imply a good performance in changing environments where tasks are sampled from heterogeneous distributions. To address the fairness-aware online learning problem in changing environments, in this paper, we first construct a novel regret metric FairSAR by adding long-term fairness constraints onto a strongly adapted loss regret. Furthermore, to determine a good model parameter at each round, we propose a novel adaptive fairness-aware online meta-learning algorithm, namely FairSAOML, which is able to adapt to changing environments in both bias control and model precision. The problem is formulated in the form of a bi-level convex-concave optimization with respect to the model's primal and dual parameters that are associated with the model's accuracy and fairness, respectively. The theoretic analysis provides sub-linear upper bounds for both loss regret and violation of cumulative fairness constraints. Our experimental evaluation on different real-world datasets with settings of changing environments suggests that the proposed FairSAOML significantly outperforms alternatives based on the best prior online learning approaches.

preprint2022arXiv

The Fairness Field Guide: Perspectives from Social and Formal Sciences

Over the past several years, a slew of different methods to measure the fairness of a machine learning model have been proposed. However, despite the growing number of publications and implementations, there is still a critical lack of literature that explains the interplay of fair machine learning with the social sciences of philosophy, sociology, and law. We hope to remedy this issue by accumulating and expounding upon the thoughts and discussions of fair machine learning produced by both social and formal (specifically machine learning and statistics) sciences in this field guide. Specifically, in addition to giving the mathematical and algorithmic backgrounds of several popular statistical and causal-based fair machine learning methods, we explain the underlying philosophical and legal thoughts that support them. Further, we explore several criticisms of the current approaches to fair machine learning from sociological and philosophical viewpoints. It is our hope that this field guide will help fair machine learning practitioners better understand how their algorithms align with important humanistic values (such as fairness) and how we can, as a field, design methods and metrics to better serve oppressed and marginalized populaces.

preprint2022arXiv

Trustworthy Anomaly Detection: A Survey

Anomaly detection has a wide range of real-world applications, such as bank fraud detection and cyber intrusion detection. In the past decade, a variety of anomaly detection models have been developed, which lead to big progress towards accurately detecting various anomalies. Despite the successes, anomaly detection models still face many limitations. The most significant one is whether we can trust the detection results from the models. In recent years, the research community has spent a great effort to design trustworthy machine learning models, such as developing trustworthy classification models. However, the attention to anomaly detection tasks is far from sufficient. Considering that many anomaly detection tasks are life-changing tasks involving human beings, labeling someone as anomalies or fraudsters should be extremely cautious. Hence, ensuring the anomaly detection models conducted in a trustworthy fashion is an essential requirement to deploy the models to conduct automatic decisions in the real world. In this brief survey, we summarize the existing efforts and discuss open problems towards trustworthy anomaly detection from the perspectives of interpretability, fairness, robustness, and privacy-preservation.

preprint2021arXiv

LogBERT: Log Anomaly Detection via BERT

Detecting anomalous events in online computer systems is crucial to protect the systems from malicious attacks or malfunctions. System logs, which record detailed information of computational events, are widely used for system status analysis. In this paper, we propose LogBERT, a self-supervised framework for log anomaly detection based on Bidirectional Encoder Representations from Transformers (BERT). LogBERT learns the patterns of normal log sequences by two novel self-supervised training tasks and is able to detect anomalies where the underlying patterns deviate from normal log sequences. The experimental results on three log datasets show that LogBERT outperforms state-of-the-art approaches for anomaly detection.

preprint2020arXiv

Deep Learning for Insider Threat Detection: Review, Challenges and Opportunities

Insider threats, as one type of the most challenging threats in cyberspace, usually cause significant loss to organizations. While the problem of insider threat detection has been studied for a long time in both security and data mining communities, the traditional machine learning based detection approaches, which heavily rely on feature engineering, are hard to accurately capture the behavior difference between insiders and normal users due to various challenges related to the characteristics of underlying data, such as high-dimensionality, complexity, heterogeneity, sparsity, lack of labeled insider threats, and the subtle and adaptive nature of insider threats. Advanced deep learning techniques provide a new paradigm to learn end-to-end models from complex data. In this brief survey, we first introduce one commonly-used dataset for insider threat detection and review the recent literature about deep learning for such research. The existing studies show that compared with traditional machine learning algorithms, deep learning models can improve the performance of insider threat detection. However, applying deep learning to further advance the insider threat detection task still faces several limitations, such as lack of labeled data, adaptive attacks. We then discuss such challenges and suggest future research directions that have the potential to address challenges and further boost the performance of deep learning for insider threat detection.

preprint2016arXiv

A causal framework for discovering and removing direct and indirect discrimination

Anti-discrimination is an increasingly important task in data science. In this paper, we investigate the problem of discovering both direct and indirect discrimination from the historical data, and removing the discriminatory effects before the data is used for predictive analysis (e.g., building classifiers). We make use of the causal network to capture the causal structure of the data. Then we model direct and indirect discrimination as the path-specific effects, which explicitly distinguish the two types of discrimination as the causal effects transmitted along different paths in the network. Based on that, we propose an effective algorithm for discovering direct and indirect discrimination, as well as an algorithm for precisely removing both types of discrimination while retaining good data utility. Different from previous works, our approaches can ensure that the predictive models built from the modified data will not incur discrimination in decision making. Experiments using real datasets show the effectiveness of our approaches.

preprint2016arXiv

Achieving non-discrimination in data release

Discrimination discovery and prevention/removal are increasingly important tasks in data mining. Discrimination discovery aims to unveil discriminatory practices on the protected attribute (e.g., gender) by analyzing the dataset of historical decision records, and discrimination prevention aims to remove discrimination by modifying the biased data before conducting predictive analysis. In this paper, we show that the key to discrimination discovery and prevention is to find the meaningful partitions that can be used to provide quantitative evidences for the judgment of discrimination. With the support of the causal graph, we present a graphical condition for identifying a meaningful partition. Based on that, we develop a simple criterion for the claim of non-discrimination, and propose discrimination removal algorithms which accurately remove discrimination while retaining good data utility. Experiments using real datasets show the effectiveness of our approaches.

preprint2016arXiv

On Spectral Analysis of Directed Signed Graphs

It has been shown that the adjacency eigenspace of a network contains key information of its underlying structure. However, there has been no study on spectral analysis of the adjacency matrices of directed signed graphs. In this paper, we derive theoretical approximations of spectral projections from such directed signed networks using matrix perturbation theory. We use the derived theoretical results to study the influences of negative intra cluster and inter cluster directed edges on node spectral projections. We then develop a spectral clustering based graph partition algorithm, SC-DSG, and conduct evaluations on both synthetic and real datasets. Both theoretical analysis and empirical evaluation demonstrate the effectiveness of the proposed algorithm.

preprint2012arXiv

Approximate Inverse Frequent Itemset Mining: Privacy, Complexity, and Approximation

In order to generate synthetic basket data sets for better benchmark testing, it is important to integrate characteristics from real-life databases into the synthetic basket data sets. The characteristics that could be used for this purpose include the frequent itemsets and association rules. The problem of generating synthetic basket data sets from frequent itemsets is generally referred to as inverse frequent itemset mining. In this paper, we show that the problem of approximate inverse frequent itemset mining is {\bf NP}-complete. Then we propose and analyze an approximate algorithm for approximate inverse frequent itemset mining, and discuss privacy issues related to the synthetic basket data set. In particular, we propose an approximate algorithm to determine the privacy leakage in a synthetic basket data set.

Xintao Wu

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Histopathology-centered Computational Evolution of Spatial Omics: Integration, Mapping, and Foundation Models

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

InfoFair: Information-Theoretic Intersectional Fairness

Adaptive Fairness-Aware Online Meta-Learning for Changing Environments

The Fairness Field Guide: Perspectives from Social and Formal Sciences

Trustworthy Anomaly Detection: A Survey

LogBERT: Log Anomaly Detection via BERT

Deep Learning for Insider Threat Detection: Review, Challenges and Opportunities

A causal framework for discovering and removing direct and indirect discrimination

Achieving non-discrimination in data release

On Spectral Analysis of Directed Signed Graphs

Approximate Inverse Frequent Itemset Mining: Privacy, Complexity, and Approximation