Researcher profile

Lu Cheng

Lu Cheng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series

Causal representation learning (CRL) seeks to recover latent variables with identifiability guarantees, typically up to permutation and component-wise reparameterization under appropriate assumptions. However, identifiability does not imply interpretability: latent semantics are typically assigned post hoc by alignment with known ground-truth factors. This limitation is particularly acute in scientific time series, where underlying mechanisms are unknown and discovering interpretable structure is a primary goal. In contrast, scientific observations (such as residue-pair distances, climate indices, or process sensors) are inherently semantic, as they correspond to named physical quantities. This raises a key question: can the interpretability of observations be transferred to the identifiable latent space? We propose MOSAIC (Module discovery via Sparse Additive Identifiable Causal learning), a sparse temporal VAE that integrates temporal CRL identifiability with support recovery over observed variables. MOSAIC identifies latent variables via regime-conditioned temporal variation, and recovers for each latent a sparse set of associated observations through an additive decoder, yielding module-level interpretability. We show that ANOVA main-effect supports are identifiable under general smooth mixing functions, and provide finite-sample recovery guarantees for a tractable sparse-additive variant. Empirically, MOSAIC recovers domain-consistent variable groups across RNA molecular dynamics, solar wind, ENSO climate, the Tennessee Eastman process, and a synthetic tokamak benchmark, enabling interpretable discovery of latent mechanisms in scientific time series.

preprint2022arXiv

Bridging the Gap: Commonality and Differences between Online and Offline COVID-19 Data

With the onset of the COVID-19 pandemic, news outlets and social media have become central tools for disseminating and consuming information. Because of their ease of access, users seek COVID-19-related information from online social media (i.e., online news) and news outlets (i.e., offline news). Online and offline news are often connected, sharing common topics while each has unique, different topics. A gap between these two news sources can lead to misinformation propagation. For instance, according to the Guardian, most COVID-19 misinformation comes from users on social media. Without fact-checking social media news, misinformation can lead to health threats. In this paper, we focus on the novel problem of bridging the gap between online and offline data by monitoring their common and distinct topics generated over time. We employ Twitter (online) and local news (offline) data for a time span of two years. Using online matrix factorization, we analyze and study online and offline COVID-19-related data differences and commonalities. We design experiments to show how online and offline data are linked together and what trends they follow.

preprint2022arXiv

Causal Disentanglement with Network Information for Debiased Recommendations

Recommender systems aim to recommend new items to users by learning user and item representations. In practice, these representations are highly entangled as they consist of information about multiple factors, including user's interests, item attributes along with confounding factors such as user conformity, and item popularity. Considering these entangled representations for inferring user preference may lead to biased recommendations (e.g., when the recommender model recommends popular items even if they do not align with the user's interests). Recent research proposes to debias by modeling a recommender system from a causal perspective. The exposure and the ratings are analogous to the treatment and the outcome in the causal inference framework, respectively. The critical challenge in this setting is accounting for the hidden confounders. These confounders are unobserved, making it hard to measure them. On the other hand, since these confounders affect both the exposure and the ratings, it is essential to account for them in generating debiased recommendations. To better approximate hidden confounders, we propose to leverage network information (i.e., user-social and user-item networks), which are shown to influence how users discover and interact with an item. Aside from the user conformity, aspects of confounding such as item popularity present in the network information is also captured in our method with the aid of \textit{causal disentanglement} which unravels the learned representations into independent factors that are responsible for (a) modeling the exposure of an item to the user, (b) predicting the ratings, and (c) controlling the hidden confounders. Experiments on real-world datasets validate the effectiveness of the proposed model for debiasing recommender systems.

preprint2022arXiv

Causal Mediation Analysis with Hidden Confounders

An important problem in causal inference is to break down the total effect of a treatment on an outcome into different causal pathways and to quantify the causal effect in each pathway. For instance, in causal fairness, the total effect of being a male employee (i.e., treatment) constitutes its direct effect on annual income (i.e., outcome) and the indirect effect via the employee's occupation (i.e., mediator). Causal mediation analysis (CMA) is a formal statistical framework commonly used to reveal such underlying causal mechanisms. One major challenge of CMA in observational studies is handling confounders, variables that cause spurious causal relationships among treatment, mediator, and outcome. Conventional methods assume sequential ignorability that implies all confounders can be measured, which is often unverifiable in practice. This work aims to circumvent the stringent sequential ignorability assumptions and consider hidden confounders. Drawing upon proxy strategies and recent advances in deep learning, we propose to simultaneously uncover the latent variables that characterize hidden confounders and estimate the causal effects. Empirical evaluations using both synthetic and semi-synthetic datasets validate the effectiveness of the proposed method. We further show the potentials of our approach for causal fairness analysis.

preprint2022arXiv

Debiasing Word Embeddings with Nonlinear Geometry

Debiasing word embeddings has been largely limited to individual and independent social categories. However, real-world corpora typically present multiple social categories that possibly correlate or intersect with each other. For instance, "hair weaves" is stereotypically associated with African American females, but neither African American nor females alone. Therefore, this work studies biases associated with multiple social categories: joint biases induced by the union of different categories and intersectional biases that do not overlap with the biases of the constituent categories. We first empirically observe that individual biases intersect non-trivially (i.e., over a one-dimensional subspace). Drawing from the intersectional theory in social science and the linguistic theory, we then construct an intersectional subspace to debias for multiple social categories using the nonlinear geometry of individual biases. Empirical evaluations corroborate the efficacy of our approach. Data and implementation code can be downloaded at https://github.com/GitHubLuCheng/Implementation-of-JoSEC-COLING-22.

preprint2022arXiv

Effects of Multi-Aspect Online Reviews with Unobserved Confounders: Estimation and Implication

Online review systems are the primary means through which many businesses seek to build the brand and spread their messages. Prior research studying the effects of online reviews has been mainly focused on a single numerical cause, e.g., ratings or sentiment scores. We argue that such notions of causes entail three key limitations: they solely consider the effects of single numerical causes and ignore different effects of multiple aspects -- e.g., Food, Service -- embedded in the textual reviews; they assume the absence of hidden confounders in observational studies, e.g., consumers' personal preferences; and they overlook the indirect effects of numerical causes that can potentially cancel out the effect of textual reviews on business revenue. We thereby propose an alternative perspective to this single-cause-based effect estimation of online reviews: in the presence of hidden confounders, we consider multi-aspect textual reviews, particularly, their total effects on business revenue and direct effects with the numerical cause -- ratings -- being the mediator. We draw on recent advances in machine learning and causal inference to together estimate the hidden confounders and causal effects. We present empirical evaluations using real-world examples to discuss the importance and implications of differentiating the multi-aspect effects in strategizing business operations.

preprint2022arXiv

Estimating Causal Effects of Multi-Aspect Online Reviews with Multi-Modal Proxies

Online reviews enable consumers to engage with companies and provide important feedback. Due to the complexity of the high-dimensional text, these reviews are often simplified as a single numerical score, e.g., ratings or sentiment scores. This work empirically examines the causal effects of user-generated online reviews on a granular level: we consider multiple aspects, e.g., the Food and Service of a restaurant. Understanding consumers' opinions toward different aspects can help evaluate business performance in detail and strategize business operations effectively. Specifically, we aim to answer interventional questions such as What will the restaurant popularity be if the quality w.r.t. its aspect Service is increased by 10%? The defining challenge of causal inference with observational data is the presence of "confounder", which might not be observed or measured, e.g., consumers' preference to food type, rendering the estimated effects biased and high-variance. To address this challenge, we have recourse to the multi-modal proxies such as the consumer profile information and interactions between consumers and businesses. We show how to effectively leverage the rich information to identify and estimate causal effects of multiple aspects embedded in online reviews. Empirical evaluations on synthetic and real-world data corroborate the efficacy and shed light on the actionable insight of the proposed approach.

preprint2022arXiv

Evaluation Methods and Measures for Causal Learning Algorithms

The convenient access to copious multi-faceted data has encouraged machine learning researchers to reconsider correlation-based learning and embrace the opportunity of causality-based learning, i.e., causal machine learning (causal learning). Recent years have therefore witnessed great effort in developing causal learning algorithms aiming to help AI achieve human-level intelligence. Due to the lack-of ground-truth data, one of the biggest challenges in current causal learning research is algorithm evaluations. This largely impedes the cross-pollination of AI and causal inference, and hinders the two fields to benefit from the advances of the other. To bridge from conventional causal inference (i.e., based on statistical methods) to causal learning with big data (i.e., the intersection of causal inference and machine learning), in this survey, we review commonly-used datasets, evaluation methods, and measures for causal learning using an evaluation pipeline similar to conventional machine learning. We focus on the two fundamental causal-inference tasks and causality-aware machine learning tasks. Limitations of current evaluation procedures are also discussed. We then examine popular causal inference tools/packages and conclude with primary challenges and opportunities for benchmarking causal learning algorithms in the era of big data. The survey seeks to bring to the forefront the urgency of developing publicly available benchmarks and consensus-building standards for causal learning evaluation with observational data. In doing so, we hope to broaden the discussions and facilitate collaboration to advance the innovation and application of causal learning.

preprint2022arXiv

Human Instance Segmentation and Tracking via Data Association and Single-stage Detector

Human video instance segmentation plays an important role in computer understanding of human activities and is widely used in video processing, video surveillance, and human modeling in virtual reality. Most current VIS methods are based on Mask-RCNN framework, where the target appearance and motion information for data matching will increase computational cost and have an impact on segmentation real-time performance; on the other hand, the existing datasets for VIS focus less on all the people appearing in the video. In this paper, to solve the problems, we develop a new method for human video instance segmentation based on single-stage detector. To tracking the instance across the video, we have adopted data association strategy for matching the same instance in the video sequence, where we jointly learn target instance appearances and their affinities in a pair of video frames in an end-to-end fashion. We have also adopted the centroid sampling strategy for enhancing the embedding extraction ability of instance, which is to bias the instance position to the inside of each instance mask with heavy overlap condition. As a result, even there exists a sudden change in the character activity, the instance position will not move out of the mask, so that the problem that the same instance is represented by two different instances can be alleviated. Finally, we collect PVIS dataset by assembling several video instance segmentation datasets to fill the gap of the current lack of datasets dedicated to human video segmentation. Extensive simulations based on such dataset has been conduct. Simulation results verify the effectiveness and efficiency of the proposed work.

preprint2022arXiv

Improving Vaccine Stance Detection by Combining Online and Offline Data

Differing opinions about COVID-19 have led to various online discourses regarding vaccines. Due to the detrimental effects and the scale of the COVID-19 pandemic, detecting vaccine stance has become especially important and is attracting increasing attention. Communication during the pandemic is typically done via online and offline sources, which provide two complementary avenues for detecting vaccine stance. Therefore, this paper aims to (1) study the importance of integrating online and offline data to vaccine stance detection; and (2) identify the critical online and offline attributes that influence an individual's vaccine stance. We model vaccine hesitancy as a surrogate for identifying the importance of online and offline factors. With the aid of explainable AI and combinatorial analysis, we conclude that both online and offline factors help predict vaccine stance.

preprint2022arXiv

On Block Accelerations of Quantile Randomized Kaczmarz for Corrupted Systems of Linear Equations

With the growth of large data as well as large-scale learning tasks, the need for efficient and robust linear system solvers is greater than ever. The randomized Kaczmarz method (RK) and similar stochastic iterative methods have received considerable recent attention due to their efficient implementation and memory footprint. These methods can tolerate streaming data, accessing only part of the data at a time, and can also approximate the least squares solution even if the system is affected by noise. However, when data is instead affected by large (possibly adversarial) corruptions, these methods fail to converge, as corrupted data points draw iterates far from the true solution. A recently proposed solution to this is the QuantileRK method, which avoids harmful corrupted data by exploring the space carefully as the method iterates. The exploration component requires the computation of quantiles of large samples from the system and is computationally much heavier than the subsequent iteration update. In this paper, we propose an approach that better uses the information obtained during exploration by incorporating an averaged version of the block Kaczmarz method. This significantly speeds up convergence, while still allowing for a constant fraction of the equations to be arbitrarily corrupted. We provide theoretical convergence guarantees as well as experimental supporting evidence. We also demonstrate that the classical projection-based block Kaczmarz method cannot be robust to sparse adversarial corruptions, but rather the blocking has to be carried out by averaging one-dimensional projections.

preprint2022arXiv

Toward Understanding Bias Correlations for Mitigation in NLP

Natural Language Processing (NLP) models have been found discriminative against groups of different social identities such as gender and race. With the negative consequences of these undesired biases, researchers have responded with unprecedented effort and proposed promising approaches for bias mitigation. In spite of considerable practical importance, current algorithmic fairness literature lacks an in-depth understanding of the relations between different forms of biases. Social bias is complex by nature. Numerous studies in social psychology identify the "generalized prejudice", i.e., generalized devaluing sentiments across different groups. For example, people who devalue ethnic minorities are also likely to devalue women and gays. Therefore, this work aims to provide a first systematic study toward understanding bias correlations in mitigation. In particular, we examine bias mitigation in two common NLP tasks -- toxicity detection and word embeddings -- on three social identities, i.e., race, gender, and religion. Our findings suggest that biases are correlated and present scenarios in which independent debiasing approaches dominant in current literature may be insufficient. We further investigate whether jointly mitigating correlated biases is more desired than independent and individual debiasing. Lastly, we shed light on the inherent issue of debiasing-accuracy trade-off in bias mitigation. This study serves to motivate future research on joint bias mitigation that accounts for correlated biases.

preprint2021arXiv

Improving Cyberbully Detection with User Interaction

Cyberbullying, identified as intended and repeated online bullying behavior, has become increasingly prevalent in the past few decades. Despite the significant progress made thus far, the focus of most existing work on cyberbullying detection lies in the independent content analysis of different comments within a social media session. We argue that such leading notions of analysis suffer from three key limitations: they overlook the temporal correlations among different comments; they only consider the content within a single comment rather than the topic coherence across comments; they remain generic and exploit limited interactions between social media users. In this work, we observe that user comments in the same session may be inherently related, e.g., discussing similar topics, and their interaction may evolve over time. We also show that modeling such topic coherence and temporal interaction are critical to capture the repetitive characteristics of bullying behavior, thus leading to better predicting performance. To achieve the goal, we first construct a unified temporal graph for each social media session. Drawing on recent advances in graph neural network, we then propose a principled graph-based approach for modeling the temporal dynamics and topic coherence throughout user interactions. We empirically evaluate the effectiveness of our approach with the tasks of session-level bullying detection and comment-level case study. Our code is released to public.

preprint2020arXiv

A Survey of Learning Causality with Data: Problems and Methods

This work considers the question of how convenient access to copious data impacts our ability to learn causal effects and relations. In what ways is learning causality in the era of big data different from -- or the same as -- the traditional one? To answer this question, this survey provides a comprehensive and structured review of both traditional and frontier methods in learning causality and relations along with the connections between causality and machine learning. This work points out on a case-by-case basis how big data facilitates, complicates, or motivates each approach.

preprint2020arXiv

Unsupervised Cyberbullying Detection via Time-Informed Gaussian Mixture Model

Social media is a vital means for information-sharing due to its easy access, low cost, and fast dissemination characteristics. However, increases in social media usage have corresponded with a rise in the prevalence of cyberbullying. Most existing cyberbullying detection methods are supervised and, thus, have two key drawbacks: (1) The data labeling process is often time-consuming and labor-intensive; (2) Current labeling guidelines may not be generalized to future instances because of different language usage and evolving social networks. To address these limitations, this work introduces a principled approach for unsupervised cyberbullying detection. The proposed model consists of two main components: (1) A representation learning network that encodes the social media session by exploiting multi-modal features, e.g., text, network, and time. (2) A multi-task learning network that simultaneously fits the comment inter-arrival times and estimates the bullying likelihood based on a Gaussian Mixture Model. The proposed model jointly optimizes the parameters of both components to overcome the shortcomings of decoupled training. Our core contribution is an unsupervised cyberbullying detection model that not only experimentally outperforms the state-of-the-art unsupervised models, but also achieves competitive performance compared to supervised models.