Researcher profile

Tao Shen

Tao Shen contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.

preprint2026arXiv

MIPO: Mutual Integration of Patient Journey and Medical Ontology for Healthcare Representation Learning

Representation learning on electronic health records (EHRs) plays a vital role in downstream medical prediction tasks. Although natural language processing techniques, such as recurrent neural networks, and self-attention, have been adapted for learning medical representations from hierarchical, time-stamped EHR data, they often struggle when either general or task-specific data are limited. Recent efforts have attempted to mitigate this challenge by incorporating medical ontologies (i.e., knowledge graphs) into self-supervised tasks like diagnosis prediction. However, two main issues remain: (1) small and uniform ontologies that lack diversity for robust learning, and (2) insufficient attention to the critical contexts or dependencies underlying patient journeys, which could further enhance ontology-based learning. To address these gaps, we propose MIPO (Mutual Integration of Patient Journey and Medical Ontology), a robust end-to-end framework that employs a Transformer-based architecture for representation learning. MIPO emphasizes task-specific representation learning through a sequential diagnosis prediction task, while also incorporating an ontology-based disease-typing task. A graph-embedding module is introduced to integrate information from patient visit records, thus alleviating data insufficiency. This setup creates a mutually reinforcing loop, where both patient-journey embedding and ontology embedding benefit from each other. We validate MIPO on two real-world benchmark datasets, showing that it consistently outperforms baseline methods under both sufficient and limited data conditions. Furthermore, the resulting diagnosis embeddings offer improved interpretability, underscoring the promise of MIPO for real-world healthcare applications.

preprint2026arXiv

Multispectral UV Imaging on Capacitive CMOS Arrays Enabled by Solution-Processed Metal-Oxide Nanoparticles

Ultraviolet (UV) imagers are important for a variety of applications, such as quality inspection in the semiconductor industry, forensics and food quality inspection, but are often costly because they require dedicated semiconductor process flows. Here, an imaging chip is introduced that has been fabricated using standard 40 nm complementary metal-oxidesemiconductor (CMOS) technology. Instead of using a conventional charge-based photodetection principle, the imager uses a capacitive operation principle where UV-light causes capacitance changes via the photodielectric effect in a functionalization layer, which are measured by the underlying CMOS circuitry. This spin-coated or inkjet-printed functionalization layer consists of solution-processed, wide-bandgap, semiconducting metaloxide nanoparticles, and facilitates multispectral imaging. The sensors exhibit low noiseequivalent powers (17-138 fW Hz^-1/2) across the UV bands. Unlike conventional silicon CMOS imagers, the present capacitive-CMOS platform is inherently visible-blind, providing selective UV detection. This work positions late-functionalized capacitive-CMOS arrays as a route toward reducing the cost of UV imagers, which can lead to their more widespread implementation in consumer and low-volume application-specific products.

preprint2026arXiv

Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning

Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.

preprint2022arXiv

ClarET: Pre-training a Correlation-Aware Context-To-Event Transformer for Event-Centric Generation and Classification

Generating new events given context with correlated ones plays a crucial role in many event-centric reasoning tasks. Existing works either limit their scope to specific scenarios or overlook event-level correlations. In this paper, we propose to pre-train a general Correlation-aware context-to-Event Transformer (ClarET) for event-centric reasoning. To achieve this, we propose three novel event-centric objectives, i.e., whole event recovering, contrastive event-correlation encoding and prompt-based event locating, which highlight event-level correlations with effective training. The proposed ClarET is applicable to a wide range of event-centric reasoning scenarios, considering its versatility of (i) event-correlation types (e.g., causal, temporal, contrast), (ii) application formulations (i.e., generation and classification), and (iii) reasoning types (e.g., abductive, counterfactual and ending reasoning). Empirical fine-tuning results, as well as zero- and few-shot learning, on 9 benchmarks (5 generation and 4 classification tasks covering 4 reasoning types with diverse event correlations), verify its effectiveness and generalization ability.

preprint2022arXiv

Edge-Cloud Polarization and Collaboration: A Comprehensive Survey for AI

Influenced by the great success of deep learning via cloud computing and the rapid development of edge chips, research in artificial intelligence (AI) has shifted to both of the computing paradigms, i.e., cloud computing and edge computing. In recent years, we have witnessed significant progress in developing more advanced AI models on cloud servers that surpass traditional deep learning models owing to model innovations (e.g., Transformers, Pretrained families), explosion of training data and soaring computing capabilities. However, edge computing, especially edge and cloud collaborative computing, are still in its infancy to announce their success due to the resource-constrained IoT scenarios with very limited algorithms deployed. In this survey, we conduct a systematic review for both cloud and edge AI. Specifically, we are the first to set up the collaborative learning mechanism for cloud and edge modeling with a thorough review of the architectures that enable such mechanism. We also discuss potentials and practical experiences of some on-going advanced edge AI topics including pretraining models, graph neural networks and reinforcement learning. Finally, we discuss the promising directions and challenges in this field.

preprint2022arXiv

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

Non-coding RNA structure and function are essential to understanding various biological processes, such as cell signaling, gene expression, and post-transcriptional regulations. These are all among the core problems in the RNA field. With the rapid growth of sequencing technology, we have accumulated a massive amount of unannotated RNA sequences. On the other hand, expensive experimental observatory results in only limited numbers of annotated data and 3D structures. Hence, it is still challenging to design computational methods for predicting their structures and functions. The lack of annotated data and systematic study causes inferior performance. To resolve the issue, we propose a novel RNA foundation model (RNA-FM) to take advantage of all the 23 million non-coding RNA sequences through self-supervised learning. Within this approach, we discover that the pre-trained RNA-FM could infer sequential and evolutionary information of non-coding RNAs without using any labels. Furthermore, we demonstrate RNA-FM's effectiveness by applying it to the downstream secondary/3D structure prediction, SARS-CoV-2 genome structure and evolution prediction, protein-RNA binding preference modeling, and gene expression regulation modeling. The comprehensive experiments show that the proposed method improves the RNA structural and functional modelling results significantly and consistently. Despite only being trained with unlabelled data, RNA-FM can serve as the foundational model for the field.

preprint2022arXiv

Towards Robust Ranker for Text Retrieval

A ranker plays an indispensable role in the de facto 'retrieval & rerank' pipeline, but its training still lags behind -- learning from moderate negatives or/and serving as an auxiliary module for a retriever. In this work, we first identify two major barriers to a robust ranker, i.e., inherent label noises caused by a well-trained retriever and non-ideal negatives sampled for a high-capable ranker. Thereby, we propose multiple retrievers as negative generators improve the ranker's robustness, where i) involving extensive out-of-distribution label noises renders the ranker against each noise distribution, and ii) diverse hard negatives from a joint distribution are relatively close to the ranker's negative distribution, leading to more challenging thus effective training. To evaluate our robust ranker (dubbed R$^2$anker), we conduct experiments in various settings on the popular passage retrieval benchmark, including BM25-reranking, full-ranking, retriever distillation, etc. The empirical results verify the new state-of-the-art effectiveness of our model.

preprint2021arXiv

Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion

Human-curated knowledge graphs provide critical supportive information to various natural language processing tasks, but these graphs are usually incomplete, urging auto-completion of them. Prevalent graph embedding approaches, e.g., TransE, learn structured knowledge via representing graph elements into dense embeddings and capturing their triple-level relationship with spatial distance. However, they are hardly generalizable to the elements never visited in training and are intrinsically vulnerable to graph incompleteness. In contrast, textual encoding approaches, e.g., KG-BERT, resort to graph triple's text and triple-level contextualized representations. They are generalizable enough and robust to the incompleteness, especially when coupled with pre-trained encoders. But two major drawbacks limit the performance: (1) high overheads due to the costly scoring of all possible triples in inference, and (2) a lack of structured knowledge in the textual encoder. In this paper, we follow the textual encoding paradigm and aim to alleviate its drawbacks by augmenting it with graph embedding techniques -- a complementary hybrid of both paradigms. Specifically, we partition each triple into two asymmetric parts as in translation-based graph embedding approach, and encode both parts into contextualized representations by a Siamese-style textual encoder. Built upon the representations, our model employs both deterministic classifier and spatial measurement for representation and structure learning respectively. Moreover, we develop a self-adaptive ensemble scheme to further improve the performance by incorporating triple scores from an existing graph embedding model. In experiments, we achieve state-of-the-art performance on three benchmarks and a zero-shot dataset for link prediction, with highlights of inference costs reduced by 1-2 orders of magnitude compared to a textual encoding method.

preprint2020arXiv

Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning

In this work, we aim at equipping pre-trained language models with structured knowledge. We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs. Building upon entity-level masked language models, our first contribution is an entity masking scheme that exploits relational knowledge underlying the text. This is fulfilled by using a linked knowledge graph to select informative entities and then masking their mentions. In addition we use knowledge graphs to obtain distractors for the masked entities, and propose a novel distractor-suppressed ranking objective which is optimized jointly with masked language model. In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training, to inject language models with structured knowledge via learning from raw text. It is more efficient than retrieval-based methods that perform entity linking and integration during finetuning and inference, and generalizes more effectively than the methods that directly learn from concatenated graph triples. Experiments show that our proposed model achieves improved performance on five benchmark datasets, including question answering and knowledge base completion tasks.

preprint2020arXiv

Federated Mutual Learning

Federated learning (FL) enables collaboratively training deep learning models on decentralized data. However, there are three types of heterogeneities in FL setting bringing about distinctive challenges to the canonical federated learning algorithm (FedAvg). First, due to the Non-IIDness of data, the global shared model may perform worse than local models that solely trained on their private data; Second, the objective of center server and clients may be different, where center server seeks for a generalized model whereas client pursue a personalized model, and clients may run different tasks; Third, clients may need to design their customized model for various scenes and tasks; In this work, we present a novel federated learning paradigm, named Federated Mutual Leaning (FML), dealing with the three heterogeneities. FML allows clients training a generalized model collaboratively and a personalized model independently, and designing their private customized models. Thus, the Non-IIDness of data is no longer a bug but a feature that clients can be personally served better. The experiments show that FML can achieve better performance than alternatives in typical FL setting, and clients can be benefited from FML with different models and tasks.

preprint2020arXiv

Self-Attention Enhanced Patient Journey Understanding in Healthcare System

Understanding patients' journeys in healthcare system is a fundamental prepositive task for a broad range of AI-based healthcare applications. This task aims to learn an informative representation that can comprehensively encode hidden dependencies among medical events and its inner entities, and then the use of encoding outputs can greatly benefit the downstream application-driven tasks. A patient journey is a sequence of electronic health records (EHRs) over time that is organized at multiple levels: patient, visits and medical codes. The key challenge of patient journey understanding is to design an effective encoding mechanism which can properly tackle the aforementioned multi-level structured patient journey data with temporal sequential visits and a set of medical codes. This paper proposes a novel self-attention mechanism that can simultaneously capture the contextual and temporal relationships hidden in patient journeys. A multi-level self-attention network (MusaNet) is specifically designed to learn the representations of patient journeys that is used to be a long sequence of activities. The MusaNet is trained in end-to-end manner using the training data derived from EHRs. We evaluated the efficacy of our method on two medical application tasks with real-world benchmark datasets. The results have demonstrated the proposed MusaNet produces higher-quality representations than state-of-the-art baseline methods. The source code is available in https://github.com/xueping/MusaNet.

preprint2019arXiv

A realistic dimension-independent approach for charged defect calculations in semiconductors

First-principles calculations of charged defects have become a cornerstone of research in semiconductors and insulators by providing insights into their fundamental physical properties. But current standard approach using the so-called jellium model has encountered both conceptual ambiguity and computational difficulty, especially for low-dimensional semiconducting materials. In this Communication, we propose a physical, straightforward, and dimension-independent universal model to calculate the formation energies of charged defects in both three-dimensional (3D) bulk and low-dimensional semiconductors. Within this model, the ionized electrons or holes are placed on the realistic host band-edge states instead of the virtual jellium state, therefore, rendering it not only naturally keeps the supercell charge neutral, but also has clear physical meaning. This realistic model reproduces the same accuracy as the traditional jellium model for most of the 3D semiconducting materials, and remarkably, for the low-dimensional structures, it is able to cure the divergence caused by the artificial long-range electrostatic energy introduced in the jellium model, and hence gives meaningful formation energies of defects in charged state and transition energy levels of the corresponding defects. Our realistic method, therefore, will have significant impact for the study of defect physics in all low-dimensional systems including quantum dots, nanowires, surfaces, interfaces, and 2D materials.

preprint2019arXiv

Temporal Self-Attention Network for Medical Concept Embedding

In longitudinal electronic health records (EHRs), the event records of a patient are distributed over a long period of time and the temporal relations between the events reflect sufficient domain knowledge to benefit prediction tasks such as the rate of inpatient mortality. Medical concept embedding as a feature extraction method that transforms a set of medical concepts with a specific time stamp into a vector, which will be fed into a supervised learning algorithm. The quality of the embedding significantly determines the learning performance over the medical data. In this paper, we propose a medical concept embedding method based on applying a self-attention mechanism to represent each medical concept. We propose a novel attention mechanism which captures the contextual information and temporal relationships between medical concepts. A light-weight neural net, "Temporal Self-Attention Network (TeSAN)", is then proposed to learn medical concept embedding based solely on the proposed attention mechanism. To test the effectiveness of our proposed methods, we have conducted clustering and prediction tasks on two public EHRs datasets comparing TeSAN against five state-of-the-art embedding methods. The experimental results demonstrate that the proposed TeSAN model is superior to all the compared methods. To the best of our knowledge, this work is the first to exploit temporal self-attentive relations between medical events.