Source author record

Jing Gao

Jing Gao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computation and Language Computer Vision hep-ph Information Retrieval Databases Social and Information Networks Cryptography and Security cs.CY eess.IV hep-ex hep-lat Methodology physics.ins-det physics.optics physics.soc-ph

Catalog footprint

What is connected

23works

17topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Uncovering Latent Pathological Signatures in Pulmonary CT via Cross-Window Knowledge Distillation

Multi-window CT imaging captures complementary pathological information across anatomical structures of differing densities, yet existing deep learning methods fuse representations only at later stages, missing cross-density interactions. We propose a cross-window knowledge distillation framework in which student encoders learn latent clinical priors from a teacher trained on the most informative window. Evaluated retrospectively on three cohorts - COPD-CT-DF (n=719), RSNA PE (n=1,433), and an in-house CTEPD dataset (n=161) - distillation improved per-window AUC by 10.1-16.5 percentage points on COPD-CT-DF (0.75-0.81 to 0.90-0.94; all P<0.001), with ensemble AUC reaching 0.9960. Similar gains were observed on RSNA PE (0.80-0.83 to 0.90-0.92) and CTEPD (AUC 0.7481 vs. 0.6264). Cross-window distillation internalises pathological signatures invisible to supervised approaches, offering a generalisable solution for multi-window pulmonary CT analysis.

preprint2025arXiv

Towards Privacy-Preserving and Heterogeneity-aware Split Federated Learning via Probabilistic Masking

Split Federated Learning (SFL) has emerged as an efficient alternative to traditional Federated Learning (FL) by reducing client-side computation through model partitioning. However, exchanging of intermediate activations and model updates introduces significant privacy risks, especially from data reconstruction attacks that recover original inputs from intermediate representations. Existing defenses using noise injection often degrade model performance. To overcome these challenges, we present PM-SFL, a scalable and privacy-preserving SFL framework that incorporates Probabilistic Mask training to add structured randomness without relying on explicit noise. This mitigates data reconstruction risks while maintaining model utility. To address data heterogeneity, PM-SFL employs personalized mask learning that tailors submodel structures to each client's local data. For system heterogeneity, we introduce a layer-wise knowledge compensation mechanism, enabling clients with varying resources to participate effectively under adaptive model splitting. Theoretical analysis confirms its privacy protection, and experiments on image and wireless sensing tasks demonstrate that PM-SFL consistently improves accuracy, communication efficiency, and robustness to privacy attacks, with particularly strong performance under data and system heterogeneity.

preprint2024arXiv

Advanced Unstructured Data Processing for ESG Reports: A Methodology for Structured Transformation and Enhanced Analysis

In the evolving field of corporate sustainability, analyzing unstructured Environmental, Social, and Governance (ESG) reports is a complex challenge due to their varied formats and intricate content. This study introduces an innovative methodology utilizing the "Unstructured Core Library", specifically tailored to address these challenges by transforming ESG reports into structured, analyzable formats. Our approach significantly advances the existing research by offering high-precision text cleaning, adept identification and extraction of text from images, and standardization of tables within these reports. Emphasizing its capability to handle diverse data types, including text, images, and tables, the method adeptly manages the nuances of differing page layouts and report styles across industries. This research marks a substantial contribution to the fields of industrial ecology and corporate sustainability assessment, paving the way for the application of advanced NLP technologies and large language models in the analysis of corporate governance and sustainability. Our code is available at https://github.com/linancn/TianGong-AI-Unstructure.git.

preprint2023arXiv

A portable sub Hertz ultra-stable laser over 1700km highway transportation

We present a subHz linewidth portable ultrastable laser with the mass and volume of are 40kg and 400mm*280mm*450mm, respectively, that meets the requirements of automatic frequency locking and road transportation. A dynamic analytical model of the physical parts of ultrastable laser is established, and the first order resonance frequency is determined by FEA and well agrees with the experimentally measured result. To verify the transport performance of the portable ultrastable laser, it is tested for 100 km actual road transportation and 60 min continuous vibration, corresponding to 1700 km road transportation. The success of the test demonstrated that the portable ultrastable laser was very robust. Meanwhile, the portable ultrastable lasers shows that the median of the linewidth distribution is approximately 0.78 Hz, and the fractional frequency instability is less than 3E-15 at 1 to 10 s averaging time. This value approaches the total noise of 2.0E-15 including thermal noise and residual amplitude modulation. The robust suggested that the portable ultrastable laser might be a good candidate such as optical frequency transfer and metrological systems.

preprint2022arXiv

$B\to D \ell ν_\ell$ form factors beyond leading power and extraction of $|V_{cb}|$ and $R(D)$

We investigate the subleading-power corrections to the exclusive $B\to D \ell ν_\ell$ form factors at ${\cal O} (α_s^0)$ in the light-cone sum rules (LCSR) framework by including the two- and three-particle higher-twist contributions from the $B$-meson light-cone distribution amplitudes up to the twist-six accuracy, by taking into account the subleading terms in expanding the hard-collinear charm-quark propagator, and by evaluating the hadronic matrix element of the subleading effective current $\bar q \, γ_μ \, i \not\!\!{D}_\perp / (2 \, m_b) \, h_v$. Employing further the available leading-power results for the semileptonic $B \to D$ form factors at the next-to-leading-logarithmic level and combining our improved LCSR predictions with the recent lattice determinations, we then carry out a comprehensive phenomenological analysis on the semi-leptonic $B\to D \ell ν_\ell$ decay. We extract $|V_{cb}| = \big( 40.2^{+0.6}_{-0.5} {\big |_{\rm th}}\,\, {}^{+1.4}_{-1.4} {\big |_{\rm exp}} \big)\times 10^{-3}$ ($|V_{cb}| = \big( 40.9^{+0.6}_{-0.5} {\big |_{\rm th}}\,\, {}^{+1.0}_{-1.0} {\big |_{\rm exp}} \big)\times 10^{-3}$) using the BaBar (Belle) experimental data, and particularly obtain for the gold-plated ratio $R(D) = 0.302\pm 0.003$.

preprint2022arXiv

Label a Herd in Minutes: Individual Holstein-Friesian Cattle Identification

We describe a practically evaluated approach for training visual cattle ID systems for a whole farm requiring only ten minutes of labelling effort. In particular, for the task of automatic identification of individual Holstein-Friesians in real-world farm CCTV, we show that self-supervision, metric learning, cluster analysis, and active learning can complement each other to significantly reduce the annotation requirements usually needed to train cattle identification frameworks. Evaluating the approach on the test portion of the publicly available Cows2021 dataset, for training we use 23,350 frames across 435 single individual tracklets generated by automated oriented cattle detection and tracking in operational farm footage. Self-supervised metric learning is first employed to initialise a candidate identity space where each tracklet is considered a distinct entity. Grouping entities into equivalence classes representing cattle identities is then performed by automated merging via cluster analysis and active learning. Critically, we identify the inflection point at which automated choices cannot replicate improvements based on human intervention to reduce annotation to a minimum. Experimental results show that cluster analysis and a few minutes of labelling after automated self-supervision can improve the test identification accuracy of 153 identities to 92.44% (ARI=0.93) from the 74.9% (ARI=0.754) obtained by self-supervision only. These promising results indicate that a tailored combination of human and machine reasoning in visual cattle ID pipelines can be highly effective whilst requiring only minimal labelling effort. We provide all key source code and network weights with this paper for easy result reproduction.

preprint2022arXiv

LiST: Lite Prompted Self-training Makes Parameter-Efficient Few-shot Learners

We present a new method LiST is short for Lite Prompted Self-Training for parameter-efficient fine-tuning of large pre-trained language models (PLMs) for few-shot learning. LiST improves over recent methods that adopt prompt-based fine-tuning (FN) using two key techniques. The first is the use of self-training to leverage large amounts of unlabeled data for prompt-based FN in few-shot settings. We use self-training in conjunction with meta-learning for re-weighting noisy pseudo-prompt labels. Self-training is expensive as it requires updating all the model parameters repetitively. Therefore, we use a second technique for light-weight fine-tuning where we introduce a small number of task-specific parameters that are fine-tuned during self-training while keeping the PLM encoder frozen. Our experiments show that LiST can effectively leverage unlabeled data to improve the model performance for few-shot learning. Additionally, the fine-tuning is efficient as it only updates a small percentage of parameters and the overall model footprint is reduced since several tasks can share a common PLM encoder as backbone. A comprehensive study on six NLU tasks demonstrate LiST to improve by 35% over classic fine-tuning and 6% over prompt-based FN with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each task. With only 14M tunable parameters, LiST outperforms GPT-3 in-context learning by 33% on few-shot NLU tasks.

preprint2022arXiv

Next-to-Next-to-Leading-Order QCD Prediction for the Photon-Pion Form Factor

We accomplish the complete two-loop computation of the leading-twist contribution to the photon-pion transition form factor $γ\, γ^{\ast} \to π^0$ by applying the hard-collinear factorization theorem together with modern multi-loop techniques. The resulting predictions for the form factor indicate that the two-loop perturbative correction is numerically important. We also demonstrate that our results will play a key role in disentangling various models of the twist-two pion distribution amplitude thanks to the envisaged precision at Belle II.

preprint2022arXiv

SeATrans: Learning Segmentation-Assisted diagnosis model via Transformer

Clinically, the accurate annotation of lesions/tissues can significantly facilitate the disease diagnosis. For example, the segmentation of optic disc/cup (OD/OC) on fundus image would facilitate the glaucoma diagnosis, the segmentation of skin lesions on dermoscopic images is helpful to the melanoma diagnosis, etc. With the advancement of deep learning techniques, a wide range of methods proved the lesions/tissues segmentation can also facilitate the automated disease diagnosis models. However, existing methods are limited in the sense that they can only capture static regional correlations in the images. Inspired by the global and dynamic nature of Vision Transformer, in this paper, we propose Segmentation-Assisted diagnosis Transformer (SeATrans) to transfer the segmentation knowledge to the disease diagnosis network. Specifically, we first propose an asymmetric multi-scale interaction strategy to correlate each single low-level diagnosis feature with multi-scale segmentation features. Then, an effective strategy called SeA-block is adopted to vitalize diagnosis feature via correlated segmentation features. To model the segmentation-diagnosis interaction, SeA-block first embeds the diagnosis feature based on the segmentation information via the encoder, and then transfers the embedding back to the diagnosis feature space by a decoder. Experimental results demonstrate that SeATrans surpasses a wide range of state-of-the-art (SOTA) segmentation-assisted diagnosis methods on several disease diagnosis tasks.

preprint2021arXiv

On Estimating Recommendation Evaluation Metrics under Sampling

Since the recent study (Krichene and Rendle 2020) done by Krichene and Rendle on the sampling-based top-k evaluation metric for recommendation, there has been a lot of debates on the validity of using sampling to evaluate recommendation algorithms. Though their work and the recent work (Li et al.2020) have proposed some basic approaches for mapping the sampling-based metrics to their global counterparts which rank the entire set of items, there is still a lack of understanding and consensus on how sampling should be used for recommendation evaluation. The proposed approaches either are rather uninformative (linking sampling to metric evaluation) or can only work on simple metrics, such as Recall/Precision (Krichene and Rendle 2020; Li et al. 2020). In this paper, we introduce a new research problem on learning the empirical rank distribution, and a new approach based on the estimated rank distribution, to estimate the top-k metrics. Since this question is closely related to the underlying mechanism of sampling for recommendation, tackling it can help better understand the power of sampling and can help resolve the questions of if and how should we use sampling for evaluating recommendation. We introduce two approaches based on MLE (MaximalLikelihood Estimation) and its weighted variants, and ME(Maximal Entropy) principals to recover the empirical rank distribution, and then utilize them for metrics estimation. The experimental results show the advantages of using the new approaches for evaluating recommendation algorithms based on top-k metrics.

preprint2020arXiv

A Survey on Causal Inference

Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.

preprint2020arXiv

Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data

Product catalogs are valuable resources for eCommerce website. In the catalog, a product is associated with multiple attributes whose values are short texts, such as product name, brand, functionality and flavor. Usually individual retailers self-report these key values, and thus the catalog information unavoidably contains noisy facts. Although existing deep neural network models have shown success in conducting cross-checking between two pieces of texts, their success has to be dependent upon a large set of quality labeled data, which are hard to obtain in this validation task: products span a variety of categories. To address the aforementioned challenges, we propose a novel meta-learning latent variable approach, called MetaBridge, which can learn transferable knowledge from a subset of categories with limited labeled data and capture the uncertainty of never-seen categories with unlabeled data. More specifically, we make the following contributions. (1) We formalize the problem of validating the textual attribute values of products from a variety of categories as a natural language inference task in the few-shot learning setting, and propose a meta-learning latent variable model to jointly process the signals obtained from product profiles and textual attribute values. (2) We propose to integrate meta learning and latent variable in a unified model to effectively capture the uncertainty of various categories. (3) We propose a novel objective function based on latent variable model in the few-shot learning setting, which ensures distribution consistency between unlabeled and labeled data and prevents overfitting by sampling from the learned distribution. Extensive experiments on real eCommerce datasets from hundreds of categories demonstrate the effectiveness of MetaBridge on textual attribute validation and its outstanding performance compared with state-of-the-art approaches.

preprint2020arXiv

Decomposed Adversarial Learned Inference

Effective inference for a generative adversarial model remains an important and challenging problem. We propose a novel approach, Decomposed Adversarial Learned Inference (DALI), which explicitly matches prior and conditional distributions in both data and code spaces, and puts a direct constraint on the dependency structure of the generative model. We derive an equivalent form of the prior and conditional matching objective that can be optimized efficiently without any parametric assumption on the data. We validate the effectiveness of DALI on the MNIST, CIFAR-10, and CelebA datasets by conducting quantitative and qualitative evaluations. Results demonstrate that DALI significantly improves both reconstruction and generation as compared to other adversarial inference models.

preprint2020arXiv

Efficient Knowledge Graph Validation via Cross-Graph Representation Learning

Recent advances in information extraction have motivated the automatic construction of huge Knowledge Graphs (KGs) by mining from large-scale text corpus. However, noisy facts are unavoidably introduced into KGs that could be caused by automatic extraction. To validate the correctness of facts (i.e., triplets) inside a KG, one possible approach is to map the triplets into vector representations by capturing the semantic meanings of facts. Although many representation learning approaches have been developed for knowledge graphs, these methods are not effective for validation. They usually assume that facts are correct, and thus may overfit noisy facts and fail to detect such facts. Towards effective KG validation, we propose to leverage an external human-curated KG as auxiliary information source to help detect the errors in a target KG. The external KG is built upon human-curated knowledge repositories and tends to have high precision. On the other hand, although the target KG built by information extraction from texts has low precision, it can cover new or domain-specific facts that are not in any human-curated repositories. To tackle this challenging task, we propose a cross-graph representation learning framework, i.e., CrossVal, which can leverage an external KG to validate the facts in the target KG efficiently. This is achieved by embedding triplets based on their semantic meanings, drawing cross-KG negative samples and estimating a confidence score for each triplet based on its degree of correctness. We evaluate the proposed framework on datasets across different domains. Experimental results show that the proposed framework achieves the best performance compared with the state-of-the-art methods on large-scale KGs.

preprint2020arXiv

Large-scale Real-time Personalized Similar Product Recommendations

Similar product recommendation is one of the most common scenes in e-commerce. Many recommendation algorithms such as item-to-item Collaborative Filtering are working on measuring item similarities. In this paper, we introduce our real-time personalized algorithm to model product similarity and real-time user interests. We also introduce several other baseline algorithms including an image-similarity-based method, item-to-item collaborative filtering, and item2vec, and compare them on our large-scale real-world e-commerce dataset. The algorithms which achieve good offline results are also tested on the online e-commerce website. Our personalized method achieves a 10% improvement on the add-cart number in the real-world e-commerce scenario.

preprint2020arXiv

Practical Data Poisoning Attack against Next-Item Recommendation

Online recommendation systems make use of a variety of information sources to provide users the items that users are potentially interested in. However, due to the openness of the online platform, recommendation systems are vulnerable to data poisoning attacks. Existing attack approaches are either based on simple heuristic rules or designed against specific recommendations approaches. The former often suffers unsatisfactory performance, while the latter requires strong knowledge of the target system. In this paper, we focus on a general next-item recommendation setting and propose a practical poisoning attack approach named LOKI against blackbox recommendation systems. The proposed LOKI utilizes the reinforcement learning algorithm to train the attack agent, which can be used to generate user behavior samples for data poisoning. In real-world recommendation systems, the cost of retraining recommendation models is high, and the interaction frequency between users and a recommendation system is restricted.Given these real-world restrictions, we propose to let the agent interact with a recommender simulator instead of the target recommendation system and leverage the transferability of the generated adversarial samples to poison the target system. We also propose to use the influence function to efficiently estimate the influence of injected samples on the recommendation results, without re-training the models within the simulator. Extensive experiments on two datasets against four representative recommendation models show that the proposed LOKI achieves better attacking performance than existing methods.

preprint2020arXiv

Weak Supervision for Fake News Detection via Reinforcement Learning

Today social media has become the primary source for news. Via social media platforms, fake news travel at unprecedented speeds, reach global audiences and put users and communities at great risk. Therefore, it is extremely important to detect fake news as early as possible. Recently, deep learning based approaches have shown improved performance in fake news detection. However, the training of such models requires a large amount of labeled data, but manual annotation is time-consuming and expensive. Moreover, due to the dynamic nature of news, annotated samples may become outdated quickly and cannot represent the news articles on newly emerged events. Therefore, how to obtain fresh and high-quality labeled samples is the major challenge in employing deep learning models for fake news detection. In order to tackle this challenge, we propose a reinforced weakly-supervised fake news detection framework, i.e., WeFEND, which can leverage users' reports as weak supervision to enlarge the amount of training data for fake news detection. The proposed framework consists of three main components: the annotator, the reinforced selector and the fake news detector. The annotator can automatically assign weak labels for unlabeled news based on users' reports. The reinforced selector using reinforcement learning techniques chooses high-quality samples from the weakly labeled data and filters out those low-quality ones that may degrade the detector's prediction performance. The fake news detector aims to identify fake news based on the news content. We tested the proposed framework on a large collection of news articles published via WeChat official accounts and associated user reports. Extensive experiments on this dataset show that the proposed WeFEND model achieves the best performance compared with the state-of-the-art methods.

preprint2019arXiv

Precision calculations of $B \to V$ form factors in QCD

Applying the vacuum-to-$B$-meson correlation functions with an interpolating current for the light vector meson we construct the light-cone sum rules (LCSR) for the "effective" form factors $ξ_{\parallel}(n \cdot p)$, $ξ_{\perp}(n \cdot p)$, $Ξ_{\parallel}(τ, n \cdot p)$ and $Ξ_{\perp}(τ, n \cdot p)$, defined by the corresponding hadronic matrix elements in soft-collinear effective theory (SCET), entering the leading-power factorization formulae for QCD form factors responsible for $B \to V \ell \bar ν_{\ell}$ and $B \to V \ell \bar \ell$ decays at large hadronic recoil at next-to-leading-order in QCD. The light-quark mass effect for the local SCET form factors $ξ_{\parallel}(n \cdot p)$ and $ξ_{\perp}(n \cdot p)$ is also computed from the LCSR method with the $B$-meson light-cone distribution amplitude $ϕ_B^{+}(ω, μ)$ at ${\cal O}(α_s)$. Furthermore, the subleading power corrections to $B \to V$ form factors from the higher-twist $B$-meson light-cone distribution amplitudes are also computed with the same method at tree level up to the twist-six accuracy. Having at our disposal the LCSR predictions for $B \to V$ form factors, we further perform new determinations of the CKM matrix element $|V_{ub}|$ from the semileptonic $B \to ρ\, \ell \, \bar ν_{\ell}$ and $B \to ω\, \ell \, \bar ν_{\ell}$ decays, and predict the normalized differential branching fractions and the $q^2$-binned $K^{\ast}$ longitudinal polarization fractions of the exclusive rare $B \to K^{\ast} \, ν_{\ell} \, \bar ν_{\ell}$ decays.

preprint2016arXiv

Multi-source Hierarchical Prediction Consolidation

In big data applications such as healthcare data mining, due to privacy concerns, it is necessary to collect predictions from multiple information sources for the same instance, with raw features being discarded or withheld when aggregating multiple predictions. Besides, crowd-sourced labels need to be aggregated to estimate the ground truth of the data. Because of the imperfect predictive models or human crowdsourcing workers, noisy and conflicting information is ubiquitous and inevitable. Although state-of-the-art aggregation methods have been proposed to handle label spaces with flat structures, as the label space is becoming more and more complicated, aggregation under a label hierarchical structure becomes necessary but has been largely ignored. These label hierarchies can be quite informative as they are usually created by domain experts to make sense of highly complex label correlations for many real-world cases like protein functionality interactions or disease relationships. We propose a novel multi-source hierarchical prediction consolidation method to effectively exploits the complicated hierarchical label structures to resolve the noisy and conflicting information that inherently originates from multiple imperfect sources. We formulate the problem as an optimization problem with a closed-form solution. The proposed method captures the smoothness overall information sources as well as penalizing any consolidation result that violates the constraints derived from the label hierarchy. The hierarchical instance similarity, as well as the consolidation result, are inferred in a totally unsupervised, iterative fashion. Experimental results on both synthetic and real-world datasets show the effectiveness of the proposed method over existing alternatives.

preprint2016arXiv

Overcoming the language barrier in mobile user interface design: A case study on a mobile health app

This research report proposes a structured solution to address the need for awareness of cultural and language in user design. It will include evaluated research on established methods that already exist. Discussed ideas about how to address this situation include: what others have found to take into consideration when using design principles to develop an interface, detailed troubles and critical issues that have been previously identified and also ways that have been found already to overcome such issues. This will also involve designing a prototype application catering to resolving these issues. Overcoming the language barrier plays an important role in the process of implementing a user design interface that will satisfy users. This issue must be researched and examined to identify the issues and concerns associated in order to provide a solution in an ethical manner.

preprint2015arXiv

A Survey on Truth Discovery

Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth discovery, which integrates multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. In this survey, we focus on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects. We also discuss some future directions of truth discovery research. We hope that this survey will promote a better understanding of the current progress on truth discovery, and offer some guidelines on how to apply these approaches in application domains.

preprint2015arXiv

Urban spatial-temporal activity structures: a New Approach to Inferring the Intra-urban Functional Regions via Social Media Check-In Data

Most existing literature focuses on the exterior temporal rhythm of human movement to infer the functional regions in a city, but they neglects the underlying interdependence between the functional regions and human activities which uncovers more detailed characteristics of regions. In this research, we proposed a novel model based on the low rank approximation (LRA) to detect the functional regions using the data from about 15 million check-in records during a yearlong period in Shanghai, China. We find a series of latent structures, called urban spatial-temporal activity structure (USTAS). While interpreting these structures, a series of outstanding underlying associations between the spatial and temporal activity patterns can be found. Moreover, we can not only reproduce the observed data with a lower dimensional representative but also simultaneously project both the spatial and temporal activity patterns in the same coordinate system. By utilizing the K-means clustering algorithm, five significant types of clusters which are directly annotated with a corresponding combination of temporal activities can be obtained. This provides a clear picture of how the groups of regions are associated with different activities at different time of day. Besides the commercial and transportation dominant area, we also detect two kinds of residential areas, the developed residential areas and the developing residential areas. We further verify the spatial distribution of these clusters in the view of urban form analysis. The results shows a high consistency with the government planning from the same periods, indicating our model is applicable for inferring the functional regions via social media check-in data, and can benefit a wide range of fields, such as urban planning, public services and location-based recommender systems and other purposes.

preprint2013arXiv

Multilabel Consensus Classification

In the era of big data, a large amount of noisy and incomplete data can be collected from multiple sources for prediction tasks. Combining multiple models or data sources helps to counteract the effects of low data quality and the bias of any single model or data source, and thus can improve the robustness and the performance of predictive models. Out of privacy, storage and bandwidth considerations, in certain circumstances one has to combine the predictions from multiple models or data sources to obtain the final predictions without accessing the raw data. Consensus-based prediction combination algorithms are effective for such situations. However, current research on prediction combination focuses on the single label setting, where an instance can have one and only one label. Nonetheless, data nowadays are usually multilabeled, such that more than one label have to be predicted at the same time. Direct applications of existing prediction combination methods to multilabel settings can lead to degenerated performance. In this paper, we address the challenges of combining predictions from multiple multilabel classifiers and propose two novel algorithms, MLCM-r (MultiLabel Consensus Maximization for ranking) and MLCM-a (MLCM for microAUC). These algorithms can capture label correlations that are common in multilabel classifications, and optimize corresponding performance metrics. Experimental results on popular multilabel classification tasks verify the theoretical analysis and effectiveness of the proposed methods.

Jing Gao

What is connected

Connect this record

See the researcher in context

Building this map preview

23 published item(s)

Uncovering Latent Pathological Signatures in Pulmonary CT via Cross-Window Knowledge Distillation

Towards Privacy-Preserving and Heterogeneity-aware Split Federated Learning via Probabilistic Masking

Advanced Unstructured Data Processing for ESG Reports: A Methodology for Structured Transformation and Enhanced Analysis

A portable sub Hertz ultra-stable laser over 1700km highway transportation

$B\to D \ell ν_\ell$ form factors beyond leading power and extraction of $|V_{cb}|$ and $R(D)$

Label a Herd in Minutes: Individual Holstein-Friesian Cattle Identification

LiST: Lite Prompted Self-training Makes Parameter-Efficient Few-shot Learners

Next-to-Next-to-Leading-Order QCD Prediction for the Photon-Pion Form Factor

SeATrans: Learning Segmentation-Assisted diagnosis model via Transformer

On Estimating Recommendation Evaluation Metrics under Sampling

A Survey on Causal Inference

Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data

Decomposed Adversarial Learned Inference

Efficient Knowledge Graph Validation via Cross-Graph Representation Learning

Large-scale Real-time Personalized Similar Product Recommendations

Practical Data Poisoning Attack against Next-Item Recommendation

Weak Supervision for Fake News Detection via Reinforcement Learning

Precision calculations of $B \to V$ form factors in QCD

Multi-source Hierarchical Prediction Consolidation

Overcoming the language barrier in mobile user interface design: A case study on a mobile health app

A Survey on Truth Discovery

Urban spatial-temporal activity structures: a New Approach to Inferring the Intra-urban Functional Regions via Social Media Check-In Data

Multilabel Consensus Classification