Researcher profile

Xiaolei Huang

Xiaolei Huang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
18works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

18 published item(s)

preprint2026arXiv

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.

preprint2022arXiv

A Gumbel-based Rating Prediction Framework for Imbalanced Recommendation

Rating prediction is a core problem in recommender systems to quantify user's preferences towards items, however, rating imbalance naturally roots in real-world user ratings that cause biased predictions and lead to poor performance on tail ratings. While existing approaches in the rating prediction task deploy weighted cross-entropy to re-weight training samples, such approaches commonly assume an normal distribution, a symmetrical and balanced space. In contrast to the normal assumption, we propose a novel \underline{\emph{G}}umbel-based \underline{\emph{V}}ariational \underline{\emph{N}}etwork framework (GVN) to model rating imbalance and augment feature representations by the Gumbel distributions. We propose a Gumbel-based variational encoder to transform features into non-normal vector space. Second, we deploy a multi-scale convolutional fusion network to integrate comprehensive views of users and items from the rating matrix and user reviews. Third, we adopt a skip connection module to personalize final rating predictions. We conduct extensive experiments on five datasets with both error- and ranking-based metrics. Experiments on ranking and regression evaluation tasks prove that the GVN can effectively achieve state-of-the-art performance across the datasets and reduce the biased predictions of tail ratings. We compare with various distributions (e.g., normal and Poisson) and demonstrate the effectiveness of Gumbel-based methods on class-imbalance modeling.

preprint2022arXiv

Asymmetry Disentanglement Network for Interpretable Acute Ischemic Stroke Infarct Segmentation in Non-Contrast CT Scans

Accurate infarct segmentation in non-contrast CT (NCCT) images is a crucial step toward computer-aided acute ischemic stroke (AIS) assessment. In clinical practice, bilateral symmetric comparison of brain hemispheres is usually used to locate pathological abnormalities. Recent research has explored asymmetries to assist with AIS segmentation. However, most previous symmetry-based work mixed different types of asymmetries when evaluating their contribution to AIS. In this paper, we propose a novel Asymmetry Disentanglement Network (ADN) to automatically separate pathological asymmetries and intrinsic anatomical asymmetries in NCCTs for more effective and interpretable AIS segmentation. ADN first performs asymmetry disentanglement based on input NCCTs, which produces different types of 3D asymmetry maps. Then a synthetic, intrinsic-asymmetry-compensated and pathology-asymmetry-salient NCCT volume is generated and later used as input to a segmentation network. The training of ADN incorporates domain knowledge and adopts a tissue-type aware regularization loss function to encourage clinically-meaningful pathological asymmetry extraction. Coupled with an unsupervised 3D transformation network, ADN achieves state-of-the-art AIS segmentation performance on a public NCCT dataset. In addition to the superior performance, we believe the learned clinically-interpretable asymmetry maps can also provide insights towards a better understanding of AIS assessment. Our code is available at https://github.com/nihaomiao/MICCAI22_ADN.

preprint2022arXiv

Deep Depth from Focus with Differential Focus Volume

Depth-from-focus (DFF) is a technique that infers depth using the focus change of a camera. In this work, we propose a convolutional neural network (CNN) to find the best-focused pixels in a focal stack and infer depth from the focus estimation. The key innovation of the network is the novel deep differential focus volume (DFV). By computing the first-order derivative with the stacked features over different focal distances, DFV is able to capture both the focus and context information for focus analysis. Besides, we also introduce a probability regression mechanism for focus estimation to handle sparsely sampled focal stacks and provide uncertainty estimation to the final prediction. Comprehensive experiments demonstrate that the proposed model achieves state-of-the-art performance on multiple datasets with good generalizability and fast speed.

preprint2022arXiv

DeepStroke: An Efficient Stroke Screening Framework for Emergency Rooms with Multimodal Adversarial Deep Learning

In an emergency room (ER) setting, stroke triage or screening is a common challenge. A quick CT is usually done instead of MRI due to MRI's slow throughput and high cost. Clinical tests are commonly referred to during the process, but the misdiagnosis rate remains high. We propose a novel multimodal deep learning framework, DeepStroke, to achieve computer-aided stroke presence assessment by recognizing patterns of minor facial muscles incoordination and speech inability for patients with suspicion of stroke in an acute setting. Our proposed DeepStroke takes one-minute facial video data and audio data readily available during stroke triage for local facial paralysis detection and global speech disorder analysis. Transfer learning was adopted to reduce face-attribute biases and improve generalizability. We leverage a multi-modal lateral fusion to combine the low- and high-level features and provide mutual regularization for joint training. Novel adversarial training is introduced to obtain identity-free and stroke-discriminative features. Experiments on our video-audio dataset with actual ER patients show that DeepStroke outperforms state-of-the-art models and achieves better performance than both a triage team and ER doctors, attaining a 10.94% higher sensitivity and maintaining 7.37% higher accuracy than traditional stroke triage when specificity is aligned. Meanwhile, each assessment can be completed in less than six minutes, demonstrating the framework's great potential for clinical translation.

preprint2022arXiv

Easy Adaptation to Mitigate Gender Bias in Multilingual Text Classification

Existing approaches to mitigate demographic biases evaluate on monolingual data, however, multilingual data has not been examined. In this work, we treat the gender as domains (e.g., male vs. female) and present a standard domain adaptation model to reduce the gender bias and improve performance of text classifiers under multilingual settings. We evaluate our approach on two text classification tasks, hate speech detection and rating prediction, and demonstrate the effectiveness of our approach with three fair-aware baselines.

preprint2022arXiv

End-to-end Graph-constrained Vectorized Floorplan Generation with Panoptic Refinement

The automatic generation of floorplans given user inputs has great potential in architectural design and has recently been explored in the computer vision community. However, the majority of existing methods synthesize floorplans in the format of rasterized images, which are difficult to edit or customize. In this paper, we aim to synthesize floorplans as sequences of 1-D vectors, which eases user interaction and design customization. To generate high fidelity vectorized floorplans, we propose a novel two-stage framework, including a draft stage and a multi-round refining stage. In the first stage, we encode the room connectivity graph input by users with a graph convolutional network (GCN), then apply an autoregressive transformer network to generate an initial floorplan sequence. To polish the initial design and generate more visually appealing floorplans, we further propose a novel panoptic refinement network(PRN) composed of a GCN and a transformer network. The PRN takes the initial generated sequence as input and refines the floorplan design while encouraging the correct room connectivity with our proposed geometric loss. We have conducted extensive experiments on a real-world floorplan dataset, and the results show that our method achieves state-of-the-art performance under different settings and evaluation metrics.

preprint2022arXiv

Enriching Unsupervised User Embedding via Medical Concepts

Clinical notes in Electronic Health Records (EHR) present rich documented information of patients to inference phenotype for disease diagnosis and study patient characteristics for cohort selection. Unsupervised user embedding aims to encode patients into fixed-length vectors without human supervisions. Medical concepts extracted from the clinical notes contain rich connections between patients and their clinical categories. However, existing unsupervised approaches of user embeddings from clinical notes do not explicitly incorporate medical concepts. In this study, we propose a concept-aware unsupervised user embedding that jointly leverages text documents and medical concepts from two clinical corpora, MIMIC-III and Diabetes. We evaluate user embeddings on both extrinsic and intrinsic tasks, including phenotype classification, in-hospital mortality prediction, patient retrieval, and patient relatedness. Experiments on the two clinical corpora show our approach exceeds unsupervised baselines, and incorporating medical concepts can significantly improve the baseline performance.

preprint2022arXiv

Experimental Augmented Reality User Experience

Augmented Reality (AR) is an emerging field ripe for experimentation, especially when it comes to developing the kinds of applications and experiences that will drive mass adoption of the technology. While we aren't aware of any current consumer product that realize a wearable, wide Field of View (FoV), AR Head Mounted Display (HMD), such devices will certainly come. In order for these sophisticated, likely high-cost hardware products to succeed, it is important they provide a high quality user experience. To that end, we prototyped 4 experimental applications for wide FoV displays that will likely exist in the future. Given current AR HMD limitations, we used a AR simulator built on web technology and VR headsets to demonstrate these applications, allowing users and designers to peer into the future.

preprint2022arXiv

Learning to Adapt Domain Shifts of Moral Values via Instance Weighting

Classifying moral values in user-generated text from social media is critical in understanding community cultures and interpreting user behaviors of social movements. Moral values and language usage can change across the social movements; however, text classifiers are usually trained in source domains of existing social movements and tested in target domains of new social issues without considering the variations. In this study, we examine domain shifts of moral values and language usage, quantify the effects of domain shifts on the morality classification task, and propose a neural adaptation framework via instance weighting to improve cross-domain classification tasks. The quantification analysis suggests a strong correlation between morality shifts, language usage, and classification performance. We evaluate the neural adaptation framework on a public Twitter data across 7 social movements and gain classification improvements up to 12.1\%. Finally, we release a new data of the COVID-19 vaccine labeled with moral values and evaluate our approach on the new target domain. For the case study of the COVID-19 vaccine, our adaptation framework achieves up to 5.26\% improvements over neural baselines.

preprint2022arXiv

Unsupervised Reinforcement Adaptation for Class-Imbalanced Text Classification

Class imbalance naturally exists when train and test models in different domains. Unsupervised domain adaptation (UDA) augments model performance with only accessible annotations from the source domain and unlabeled data from the target domain. However, existing state-of-the-art UDA models learn domain-invariant representations and evaluate primarily on class-balanced data across domains. In this work, we propose an unsupervised domain adaptation approach via reinforcement learning that jointly leverages feature variants and imbalanced labels across domains. We experiment with the text classification task for its easily accessible datasets and compare the proposed method with five baselines. Experiments on three datasets prove that our proposed method can effectively learn robust domain-invariant representations and successfully adapt text classifiers on imbalanced classes over domains. The code is available at https://github.com/woqingdoua/ImbalanceClass.

preprint2021arXiv

User Factor Adaptation for User Embedding via Multitask Learning

Language varies across users and their interested fields in social media data: words authored by a user across his/her interests may have different meanings (e.g., cool) or sentiments (e.g., fast). However, most of the existing methods to train user embeddings ignore the variations across user interests, such as product and movie categories (e.g., drama vs. action). In this study, we treat the user interest as domains and empirically examine how the user language can vary across the user factor in three English social media datasets. We then propose a user embedding model to account for the language variability of user interests via a multitask learning framework. The model learns user language and its variations without human supervision. While existing work mainly evaluated the user embedding by extrinsic tasks, we propose an intrinsic evaluation via clustering and evaluate user embeddings by an extrinsic task, text classification. The experiments on the three English-language social media datasets show that our proposed approach can generally outperform baselines via adapting the user factor.

preprint2020arXiv

Multilingual Twitter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition

Existing research on fairness evaluation of document classification models mainly uses synthetic monolingual data without ground truth for author demographic attributes. In this work, we assemble and publish a multilingual Twitter corpus for the task of hate speech detection with inferred four author demographic factors: age, country, gender and race/ethnicity. The corpus covers five languages: English, Italian, Polish, Portuguese and Spanish. We evaluate the inferred demographic labels with a crowdsourcing platform, Figure Eight. To examine factors that can cause biases, we take an empirical analysis of demographic predictability on the English corpus. We measure the performance of four popular document classifiers and evaluate the fairness and bias of the baseline classifiers on the author-level demographic attributes.

preprint2020arXiv

Neural Wireframe Renderer: Learning Wireframe to Image Translations

In architecture and computer-aided design, wireframes (i.e., line-based models) are widely used as basic 3D models for design evaluation and fast design iterations. However, unlike a full design file, a wireframe model lacks critical information, such as detailed shape, texture, and materials, needed by a conventional renderer to produce 2D renderings of the objects or scenes. In this paper, we bridge the information gap by generating photo-realistic rendering of indoor scenes from wireframe models in an image translation framework. While existing image synthesis methods can generate visually pleasing images for common objects such as faces and birds, these methods do not explicitly model and preserve essential structural constraints in a wireframe model, such as junctions, parallel lines, and planar surfaces. To this end, we propose a novel model based on a structure-appearance joint representation learned from both images and wireframes. In our model, structural constraints are explicitly enforced by learning a joint representation in a shared encoder network that must support the generation of both images and wireframes. Experiments on a wireframe-scene dataset show that our wireframe-to-image translation model significantly outperforms the state-of-the-art methods in both visual quality and structural integrity of generated images.

preprint2020arXiv

Semi-Supervised Cervical Dysplasia Classification With Learnable Graph Convolutional Network

Cervical cancer is the second most prevalent cancer affecting women today. As the early detection of cervical carcinoma relies heavily upon screening and pre-clinical testing, digital cervicography has great potential as a primary or auxiliary screening tool, especially in low-resource regions due to its low cost and easy access. Although an automated cervical dysplasia detection system has been desirable, traditional fully-supervised training of such systems requires large amounts of annotated data which are often labor-intensive to collect. To alleviate the need for much manual annotation, we propose a novel graph convolutional network (GCN) based semi-supervised classification model that can be trained with fewer annotations. In existing GCNs, graphs are constructed with fixed features and can not be updated during the learning process. This limits their ability to exploit new features learned during graph convolution. In this paper, we propose a novel and more flexible GCN model with a feature encoder that adaptively updates the adjacency matrix during learning and demonstrate that this model design leads to improved performance. Our experimental results on a cervical dysplasia classification dataset show that the proposed framework outperforms previous methods under a semi-supervised setting, especially when the labeled samples are scarce.

preprint2020arXiv

SiamParseNet: Joint Body Parsing and Label Propagation in Infant Movement Videos

General movement assessment (GMA) of infant movement videos (IMVs) is an effective method for the early detection of cerebral palsy (CP) in infants. Automated body parsing is a crucial step towards computer-aided GMA, in which infant body parts are segmented and tracked over time for movement analysis. However, acquiring fully annotated data for video-based body parsing is particularly expensive due to the large number of frames in IMVs. In this paper, we propose a semi-supervised body parsing model, termed SiamParseNet (SPN), to jointly learn single frame body parsing and label propagation between frames in a semi-supervised fashion. The Siamese-structured SPN consists of a shared feature encoder, followed by two separate branches: one for intra-frame body parts segmentation, and one for inter-frame label propagation. The two branches are trained jointly, taking pairs of frames from the same videos as their input. An adaptive training process is proposed that alternates training modes between using input pairs of only labeled frames and using inputs of both labeled and unlabeled frames. During testing, we employ a multi-source inference mechanism, where the final result for a test frame is either obtained via the segmentation branch or via propagation from a nearby key frame. We conduct extensive experiments on a partially-labeled IMV dataset where SPN outperforms all prior arts, demonstrating the effectiveness of our proposed method.

preprint2020arXiv

Synthetic Sample Selection via Reinforcement Learning

Synthesizing realistic medical images provides a feasible solution to the shortage of training data in deep learning based medical image recognition systems. However, the quality control of synthetic images for data augmentation purposes is under-investigated, and some of the generated images are not realistic and may contain misleading features that distort data distribution when mixed with real images. Thus, the effectiveness of those synthetic images in medical image recognition systems cannot be guaranteed when they are being added randomly without quality assurance. In this work, we propose a reinforcement learning (RL) based synthetic sample selection method that learns to choose synthetic images containing reliable and informative features. A transformer based controller is trained via proximal policy optimization (PPO) using the validation classification accuracy as the reward. The selected images are mixed with the original training data for improved training of image recognition systems. To validate our method, we take the pathology image recognition as an example and conduct extensive experiments on two histopathology image datasets. In experiments on a cervical dataset and a lymph node dataset, the image classification performance is improved by 8.1% and 2.3%, respectively, when utilizing high-quality synthetic images selected by our RL framework. Our proposed synthetic sample selection method is general and has great potential to boost the performance of various medical image recognition systems given limited annotation.

preprint2018arXiv

Computer-Aided Diagnosis of Label-Free 3-D Optical Coherence Microscopy Images of Human Cervical Tissue

Objective: Ultrahigh-resolution optical coherence microscopy (OCM) has recently demonstrated its potential for accurate diagnosis of human cervical diseases. One major challenge for clinical adoption, however, is the steep learning curve clinicians need to overcome to interpret OCM images. Developing an intelligent technique for computer-aided diagnosis (CADx) to accurately interpret OCM images will facilitate clinical adoption of the technology and improve patient care. Methods: 497 high-resolution 3-D OCM volumes (600 cross-sectional images each) were collected from 159 ex vivo specimens of 92 female patients. OCM image features were extracted using a convolutional neural network (CNN) model, concatenated with patient information (e.g., age, HPV results), and classified using a support vector machine classifier. Ten-fold cross-validations were utilized to test the performance of the CADx method in a five-class classification task and a binary classification task. Results: An 88.3 plus or minus 4.9% classification accuracy was achieved for five fine-grained classes of cervical tissue, namely normal, ectropion, low-grade and high-grade squamous intraepithelial lesions (LSIL and HSIL), and cancer. In the binary classification task (low-risk [normal, ectropion and LSIL] vs. high-risk [HSIL and cancer]), the CADx method achieved an area-under-the-curve (AUC) value of 0.959 with an 86.7 plus or minus 11.4% sensitivity and 93.5 plus or minus 3.8% specificity. Conclusion: The proposed deep-learning based CADx method outperformed three human experts. It was also able to identify morphological characteristics in OCM images that were consistent with histopathological interpretations. Significance: Label-free OCM imaging, combined with deep-learning based CADx methods, hold a great promise to be used in clinical settings for the effective screening and diagnosis of cervical diseases.