Researcher profile

Jing Gao

Jing Gao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
18works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

18 published item(s)

preprint2026arXiv

Uncovering Latent Pathological Signatures in Pulmonary CT via Cross-Window Knowledge Distillation

Multi-window CT imaging captures complementary pathological information across anatomical structures of differing densities, yet existing deep learning methods fuse representations only at later stages, missing cross-density interactions. We propose a cross-window knowledge distillation framework in which student encoders learn latent clinical priors from a teacher trained on the most informative window. Evaluated retrospectively on three cohorts - COPD-CT-DF (n=719), RSNA PE (n=1,433), and an in-house CTEPD dataset (n=161) - distillation improved per-window AUC by 10.1-16.5 percentage points on COPD-CT-DF (0.75-0.81 to 0.90-0.94; all P<0.001), with ensemble AUC reaching 0.9960. Similar gains were observed on RSNA PE (0.80-0.83 to 0.90-0.92) and CTEPD (AUC 0.7481 vs. 0.6264). Cross-window distillation internalises pathological signatures invisible to supervised approaches, offering a generalisable solution for multi-window pulmonary CT analysis.

preprint2025arXiv

Towards Privacy-Preserving and Heterogeneity-aware Split Federated Learning via Probabilistic Masking

Split Federated Learning (SFL) has emerged as an efficient alternative to traditional Federated Learning (FL) by reducing client-side computation through model partitioning. However, exchanging of intermediate activations and model updates introduces significant privacy risks, especially from data reconstruction attacks that recover original inputs from intermediate representations. Existing defenses using noise injection often degrade model performance. To overcome these challenges, we present PM-SFL, a scalable and privacy-preserving SFL framework that incorporates Probabilistic Mask training to add structured randomness without relying on explicit noise. This mitigates data reconstruction risks while maintaining model utility. To address data heterogeneity, PM-SFL employs personalized mask learning that tailors submodel structures to each client&#39;s local data. For system heterogeneity, we introduce a layer-wise knowledge compensation mechanism, enabling clients with varying resources to participate effectively under adaptive model splitting. Theoretical analysis confirms its privacy protection, and experiments on image and wireless sensing tasks demonstrate that PM-SFL consistently improves accuracy, communication efficiency, and robustness to privacy attacks, with particularly strong performance under data and system heterogeneity.

preprint2024arXiv

Advanced Unstructured Data Processing for ESG Reports: A Methodology for Structured Transformation and Enhanced Analysis

In the evolving field of corporate sustainability, analyzing unstructured Environmental, Social, and Governance (ESG) reports is a complex challenge due to their varied formats and intricate content. This study introduces an innovative methodology utilizing the &#34;Unstructured Core Library&#34;, specifically tailored to address these challenges by transforming ESG reports into structured, analyzable formats. Our approach significantly advances the existing research by offering high-precision text cleaning, adept identification and extraction of text from images, and standardization of tables within these reports. Emphasizing its capability to handle diverse data types, including text, images, and tables, the method adeptly manages the nuances of differing page layouts and report styles across industries. This research marks a substantial contribution to the fields of industrial ecology and corporate sustainability assessment, paving the way for the application of advanced NLP technologies and large language models in the analysis of corporate governance and sustainability. Our code is available at https://github.com/linancn/TianGong-AI-Unstructure.git.

preprint2023arXiv

A portable sub Hertz ultra-stable laser over 1700km highway transportation

We present a subHz linewidth portable ultrastable laser with the mass and volume of are 40kg and 400mm*280mm*450mm, respectively, that meets the requirements of automatic frequency locking and road transportation. A dynamic analytical model of the physical parts of ultrastable laser is established, and the first order resonance frequency is determined by FEA and well agrees with the experimentally measured result. To verify the transport performance of the portable ultrastable laser, it is tested for 100 km actual road transportation and 60 min continuous vibration, corresponding to 1700 km road transportation. The success of the test demonstrated that the portable ultrastable laser was very robust. Meanwhile, the portable ultrastable lasers shows that the median of the linewidth distribution is approximately 0.78 Hz, and the fractional frequency instability is less than 3E-15 at 1 to 10 s averaging time. This value approaches the total noise of 2.0E-15 including thermal noise and residual amplitude modulation. The robust suggested that the portable ultrastable laser might be a good candidate such as optical frequency transfer and metrological systems.

preprint2022arXiv

$B\to D \ell ν_\ell$ form factors beyond leading power and extraction of $|V_{cb}|$ and $R(D)$

We investigate the subleading-power corrections to the exclusive $B\to D \ell ν_\ell$ form factors at ${\cal O} (α_s^0)$ in the light-cone sum rules (LCSR) framework by including the two- and three-particle higher-twist contributions from the $B$-meson light-cone distribution amplitudes up to the twist-six accuracy, by taking into account the subleading terms in expanding the hard-collinear charm-quark propagator, and by evaluating the hadronic matrix element of the subleading effective current $\bar q \, γ_μ \, i \not\!\!{D}_\perp / (2 \, m_b) \, h_v$. Employing further the available leading-power results for the semileptonic $B \to D$ form factors at the next-to-leading-logarithmic level and combining our improved LCSR predictions with the recent lattice determinations, we then carry out a comprehensive phenomenological analysis on the semi-leptonic $B\to D \ell ν_\ell$ decay. We extract $|V_{cb}| = \big( 40.2^{+0.6}_{-0.5} {\big |_{\rm th}}\,\, {}^{+1.4}_{-1.4} {\big |_{\rm exp}} \big)\times 10^{-3}$ ($|V_{cb}| = \big( 40.9^{+0.6}_{-0.5} {\big |_{\rm th}}\,\, {}^{+1.0}_{-1.0} {\big |_{\rm exp}} \big)\times 10^{-3}$) using the BaBar (Belle) experimental data, and particularly obtain for the gold-plated ratio $R(D) = 0.302\pm 0.003$.

preprint2022arXiv

Label a Herd in Minutes: Individual Holstein-Friesian Cattle Identification

We describe a practically evaluated approach for training visual cattle ID systems for a whole farm requiring only ten minutes of labelling effort. In particular, for the task of automatic identification of individual Holstein-Friesians in real-world farm CCTV, we show that self-supervision, metric learning, cluster analysis, and active learning can complement each other to significantly reduce the annotation requirements usually needed to train cattle identification frameworks. Evaluating the approach on the test portion of the publicly available Cows2021 dataset, for training we use 23,350 frames across 435 single individual tracklets generated by automated oriented cattle detection and tracking in operational farm footage. Self-supervised metric learning is first employed to initialise a candidate identity space where each tracklet is considered a distinct entity. Grouping entities into equivalence classes representing cattle identities is then performed by automated merging via cluster analysis and active learning. Critically, we identify the inflection point at which automated choices cannot replicate improvements based on human intervention to reduce annotation to a minimum. Experimental results show that cluster analysis and a few minutes of labelling after automated self-supervision can improve the test identification accuracy of 153 identities to 92.44% (ARI=0.93) from the 74.9% (ARI=0.754) obtained by self-supervision only. These promising results indicate that a tailored combination of human and machine reasoning in visual cattle ID pipelines can be highly effective whilst requiring only minimal labelling effort. We provide all key source code and network weights with this paper for easy result reproduction.

preprint2022arXiv

LiST: Lite Prompted Self-training Makes Parameter-Efficient Few-shot Learners

We present a new method LiST is short for Lite Prompted Self-Training for parameter-efficient fine-tuning of large pre-trained language models (PLMs) for few-shot learning. LiST improves over recent methods that adopt prompt-based fine-tuning (FN) using two key techniques. The first is the use of self-training to leverage large amounts of unlabeled data for prompt-based FN in few-shot settings. We use self-training in conjunction with meta-learning for re-weighting noisy pseudo-prompt labels. Self-training is expensive as it requires updating all the model parameters repetitively. Therefore, we use a second technique for light-weight fine-tuning where we introduce a small number of task-specific parameters that are fine-tuned during self-training while keeping the PLM encoder frozen. Our experiments show that LiST can effectively leverage unlabeled data to improve the model performance for few-shot learning. Additionally, the fine-tuning is efficient as it only updates a small percentage of parameters and the overall model footprint is reduced since several tasks can share a common PLM encoder as backbone. A comprehensive study on six NLU tasks demonstrate LiST to improve by 35% over classic fine-tuning and 6% over prompt-based FN with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each task. With only 14M tunable parameters, LiST outperforms GPT-3 in-context learning by 33% on few-shot NLU tasks.

preprint2022arXiv

Next-to-Next-to-Leading-Order QCD Prediction for the Photon-Pion Form Factor

We accomplish the complete two-loop computation of the leading-twist contribution to the photon-pion transition form factor $γ\, γ^{\ast} \to π^0$ by applying the hard-collinear factorization theorem together with modern multi-loop techniques. The resulting predictions for the form factor indicate that the two-loop perturbative correction is numerically important. We also demonstrate that our results will play a key role in disentangling various models of the twist-two pion distribution amplitude thanks to the envisaged precision at Belle II.

preprint2022arXiv

SeATrans: Learning Segmentation-Assisted diagnosis model via Transformer

Clinically, the accurate annotation of lesions/tissues can significantly facilitate the disease diagnosis. For example, the segmentation of optic disc/cup (OD/OC) on fundus image would facilitate the glaucoma diagnosis, the segmentation of skin lesions on dermoscopic images is helpful to the melanoma diagnosis, etc. With the advancement of deep learning techniques, a wide range of methods proved the lesions/tissues segmentation can also facilitate the automated disease diagnosis models. However, existing methods are limited in the sense that they can only capture static regional correlations in the images. Inspired by the global and dynamic nature of Vision Transformer, in this paper, we propose Segmentation-Assisted diagnosis Transformer (SeATrans) to transfer the segmentation knowledge to the disease diagnosis network. Specifically, we first propose an asymmetric multi-scale interaction strategy to correlate each single low-level diagnosis feature with multi-scale segmentation features. Then, an effective strategy called SeA-block is adopted to vitalize diagnosis feature via correlated segmentation features. To model the segmentation-diagnosis interaction, SeA-block first embeds the diagnosis feature based on the segmentation information via the encoder, and then transfers the embedding back to the diagnosis feature space by a decoder. Experimental results demonstrate that SeATrans surpasses a wide range of state-of-the-art (SOTA) segmentation-assisted diagnosis methods on several disease diagnosis tasks.

preprint2021arXiv

On Estimating Recommendation Evaluation Metrics under Sampling

Since the recent study (Krichene and Rendle 2020) done by Krichene and Rendle on the sampling-based top-k evaluation metric for recommendation, there has been a lot of debates on the validity of using sampling to evaluate recommendation algorithms. Though their work and the recent work (Li et al.2020) have proposed some basic approaches for mapping the sampling-based metrics to their global counterparts which rank the entire set of items, there is still a lack of understanding and consensus on how sampling should be used for recommendation evaluation. The proposed approaches either are rather uninformative (linking sampling to metric evaluation) or can only work on simple metrics, such as Recall/Precision (Krichene and Rendle 2020; Li et al. 2020). In this paper, we introduce a new research problem on learning the empirical rank distribution, and a new approach based on the estimated rank distribution, to estimate the top-k metrics. Since this question is closely related to the underlying mechanism of sampling for recommendation, tackling it can help better understand the power of sampling and can help resolve the questions of if and how should we use sampling for evaluating recommendation. We introduce two approaches based on MLE (MaximalLikelihood Estimation) and its weighted variants, and ME(Maximal Entropy) principals to recover the empirical rank distribution, and then utilize them for metrics estimation. The experimental results show the advantages of using the new approaches for evaluating recommendation algorithms based on top-k metrics.

preprint2020arXiv

A Survey on Causal Inference

Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.

preprint2020arXiv

Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data

Product catalogs are valuable resources for eCommerce website. In the catalog, a product is associated with multiple attributes whose values are short texts, such as product name, brand, functionality and flavor. Usually individual retailers self-report these key values, and thus the catalog information unavoidably contains noisy facts. Although existing deep neural network models have shown success in conducting cross-checking between two pieces of texts, their success has to be dependent upon a large set of quality labeled data, which are hard to obtain in this validation task: products span a variety of categories. To address the aforementioned challenges, we propose a novel meta-learning latent variable approach, called MetaBridge, which can learn transferable knowledge from a subset of categories with limited labeled data and capture the uncertainty of never-seen categories with unlabeled data. More specifically, we make the following contributions. (1) We formalize the problem of validating the textual attribute values of products from a variety of categories as a natural language inference task in the few-shot learning setting, and propose a meta-learning latent variable model to jointly process the signals obtained from product profiles and textual attribute values. (2) We propose to integrate meta learning and latent variable in a unified model to effectively capture the uncertainty of various categories. (3) We propose a novel objective function based on latent variable model in the few-shot learning setting, which ensures distribution consistency between unlabeled and labeled data and prevents overfitting by sampling from the learned distribution. Extensive experiments on real eCommerce datasets from hundreds of categories demonstrate the effectiveness of MetaBridge on textual attribute validation and its outstanding performance compared with state-of-the-art approaches.

preprint2020arXiv

Decomposed Adversarial Learned Inference

Effective inference for a generative adversarial model remains an important and challenging problem. We propose a novel approach, Decomposed Adversarial Learned Inference (DALI), which explicitly matches prior and conditional distributions in both data and code spaces, and puts a direct constraint on the dependency structure of the generative model. We derive an equivalent form of the prior and conditional matching objective that can be optimized efficiently without any parametric assumption on the data. We validate the effectiveness of DALI on the MNIST, CIFAR-10, and CelebA datasets by conducting quantitative and qualitative evaluations. Results demonstrate that DALI significantly improves both reconstruction and generation as compared to other adversarial inference models.

preprint2020arXiv

Efficient Knowledge Graph Validation via Cross-Graph Representation Learning

Recent advances in information extraction have motivated the automatic construction of huge Knowledge Graphs (KGs) by mining from large-scale text corpus. However, noisy facts are unavoidably introduced into KGs that could be caused by automatic extraction. To validate the correctness of facts (i.e., triplets) inside a KG, one possible approach is to map the triplets into vector representations by capturing the semantic meanings of facts. Although many representation learning approaches have been developed for knowledge graphs, these methods are not effective for validation. They usually assume that facts are correct, and thus may overfit noisy facts and fail to detect such facts. Towards effective KG validation, we propose to leverage an external human-curated KG as auxiliary information source to help detect the errors in a target KG. The external KG is built upon human-curated knowledge repositories and tends to have high precision. On the other hand, although the target KG built by information extraction from texts has low precision, it can cover new or domain-specific facts that are not in any human-curated repositories. To tackle this challenging task, we propose a cross-graph representation learning framework, i.e., CrossVal, which can leverage an external KG to validate the facts in the target KG efficiently. This is achieved by embedding triplets based on their semantic meanings, drawing cross-KG negative samples and estimating a confidence score for each triplet based on its degree of correctness. We evaluate the proposed framework on datasets across different domains. Experimental results show that the proposed framework achieves the best performance compared with the state-of-the-art methods on large-scale KGs.

preprint2020arXiv

Large-scale Real-time Personalized Similar Product Recommendations

Similar product recommendation is one of the most common scenes in e-commerce. Many recommendation algorithms such as item-to-item Collaborative Filtering are working on measuring item similarities. In this paper, we introduce our real-time personalized algorithm to model product similarity and real-time user interests. We also introduce several other baseline algorithms including an image-similarity-based method, item-to-item collaborative filtering, and item2vec, and compare them on our large-scale real-world e-commerce dataset. The algorithms which achieve good offline results are also tested on the online e-commerce website. Our personalized method achieves a 10% improvement on the add-cart number in the real-world e-commerce scenario.

preprint2020arXiv

Practical Data Poisoning Attack against Next-Item Recommendation

Online recommendation systems make use of a variety of information sources to provide users the items that users are potentially interested in. However, due to the openness of the online platform, recommendation systems are vulnerable to data poisoning attacks. Existing attack approaches are either based on simple heuristic rules or designed against specific recommendations approaches. The former often suffers unsatisfactory performance, while the latter requires strong knowledge of the target system. In this paper, we focus on a general next-item recommendation setting and propose a practical poisoning attack approach named LOKI against blackbox recommendation systems. The proposed LOKI utilizes the reinforcement learning algorithm to train the attack agent, which can be used to generate user behavior samples for data poisoning. In real-world recommendation systems, the cost of retraining recommendation models is high, and the interaction frequency between users and a recommendation system is restricted.Given these real-world restrictions, we propose to let the agent interact with a recommender simulator instead of the target recommendation system and leverage the transferability of the generated adversarial samples to poison the target system. We also propose to use the influence function to efficiently estimate the influence of injected samples on the recommendation results, without re-training the models within the simulator. Extensive experiments on two datasets against four representative recommendation models show that the proposed LOKI achieves better attacking performance than existing methods.

preprint2020arXiv

Weak Supervision for Fake News Detection via Reinforcement Learning

Today social media has become the primary source for news. Via social media platforms, fake news travel at unprecedented speeds, reach global audiences and put users and communities at great risk. Therefore, it is extremely important to detect fake news as early as possible. Recently, deep learning based approaches have shown improved performance in fake news detection. However, the training of such models requires a large amount of labeled data, but manual annotation is time-consuming and expensive. Moreover, due to the dynamic nature of news, annotated samples may become outdated quickly and cannot represent the news articles on newly emerged events. Therefore, how to obtain fresh and high-quality labeled samples is the major challenge in employing deep learning models for fake news detection. In order to tackle this challenge, we propose a reinforced weakly-supervised fake news detection framework, i.e., WeFEND, which can leverage users&#39; reports as weak supervision to enlarge the amount of training data for fake news detection. The proposed framework consists of three main components: the annotator, the reinforced selector and the fake news detector. The annotator can automatically assign weak labels for unlabeled news based on users&#39; reports. The reinforced selector using reinforcement learning techniques chooses high-quality samples from the weakly labeled data and filters out those low-quality ones that may degrade the detector&#39;s prediction performance. The fake news detector aims to identify fake news based on the news content. We tested the proposed framework on a large collection of news articles published via WeChat official accounts and associated user reports. Extensive experiments on this dataset show that the proposed WeFEND model achieves the best performance compared with the state-of-the-art methods.

preprint2019arXiv

Precision calculations of $B \to V$ form factors in QCD

Applying the vacuum-to-$B$-meson correlation functions with an interpolating current for the light vector meson we construct the light-cone sum rules (LCSR) for the &#34;effective&#34; form factors $ξ_{\parallel}(n \cdot p)$, $ξ_{\perp}(n \cdot p)$, $Ξ_{\parallel}(τ, n \cdot p)$ and $Ξ_{\perp}(τ, n \cdot p)$, defined by the corresponding hadronic matrix elements in soft-collinear effective theory (SCET), entering the leading-power factorization formulae for QCD form factors responsible for $B \to V \ell \bar ν_{\ell}$ and $B \to V \ell \bar \ell$ decays at large hadronic recoil at next-to-leading-order in QCD. The light-quark mass effect for the local SCET form factors $ξ_{\parallel}(n \cdot p)$ and $ξ_{\perp}(n \cdot p)$ is also computed from the LCSR method with the $B$-meson light-cone distribution amplitude $ϕ_B^{+}(ω, μ)$ at ${\cal O}(α_s)$. Furthermore, the subleading power corrections to $B \to V$ form factors from the higher-twist $B$-meson light-cone distribution amplitudes are also computed with the same method at tree level up to the twist-six accuracy. Having at our disposal the LCSR predictions for $B \to V$ form factors, we further perform new determinations of the CKM matrix element $|V_{ub}|$ from the semileptonic $B \to ρ\, \ell \, \bar ν_{\ell}$ and $B \to ω\, \ell \, \bar ν_{\ell}$ decays, and predict the normalized differential branching fractions and the $q^2$-binned $K^{\ast}$ longitudinal polarization fractions of the exclusive rare $B \to K^{\ast} \, ν_{\ell} \, \bar ν_{\ell}$ decays.