Researcher profile

Shaojie Tang

Shaojie Tang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
16works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

16 published item(s)

preprint2026arXiv

From Coordinate Matching to Structural Alignment: Rethinking Prototype Alignment in Heterogeneous Federated Learning

Heterogeneous federated learning (HtFL) aims to enable collaboration among clients that differ in both data distributions and model architectures. Prototype-based methods, which communicate class-level feature centers (prototypes) instead of full model parameters, have recently shown strong potential for HtFL. Existing prototype-based HtFL methods typically reuse the MSE-based or cosine-based alignment mechanism developed for homogeneous FL when aligning client-specific representations with global prototypes. These approaches are essentially coordinate alignment, where representations of clients are forced to match the global prototypes in the embedding space in an element-wise manner. Such alignment implicitly assumes that all clients should map their representations into the feature subspace defined by the global prototypes. This assumption is reasonable in homogeneous FL, where all clients share the same feature extractor. However, it becomes problematic in HtFL, since heterogeneous feature extractors naturally induce client-specific feature subspaces, and forcing all clients to optimize within a single global subspace unnecessarily suppresses their learning capacity. We observe that coordinate alignment implicitly couples two distinct objectives: aligning inter-class semantic structure, which is directly beneficial for classification, and enforcing a shared feature basis, which is unnecessary and even harmful under model heterogeneity. Building on this insight, we design FedSAF, which shifts the alignment objective from absolute coordinates to inter-class relational structure. We demonstrate that structural alignment consistently outperforms coordinate alignment in heterogeneous settings. Experiments on multiple benchmarks show that our structural alignment outperforms state-of-the-art prototype-based HtFL methods by up to 3.52\%.

preprint2022arXiv

Improved Parallel Algorithm for Minimum Cost Submodular Cover Problem

In the minimum cost submodular cover problem (MinSMC), we are given a monotone nondecreasing submodular function $f\colon 2^V \rightarrow \mathbb{Z}^+$, a linear cost function $c: V\rightarrow \mathbb R^{+}$, and an integer $k\leq f(V)$, the goal is to find a subset $A\subseteq V$ with the minimum cost such that $f(A)\geq k$. The MinSMC can be found at the heart of many machine learning and data mining applications. In this paper, we design a parallel algorithm for the MinSMC that takes at most $O(\frac{\log km\log k(\log m+\log\log mk)}{\varepsilon^4})$ adaptive rounds, and it achieves an approximation ratio of $\frac{H(\min\{Δ,k\})}{1-5\varepsilon}$ with probability at least $1-3\varepsilon$, where $Δ=\max_{v\in V}f(v)$, $H(\cdot)$ is the Harmonic number, $m=|V|$, and $\varepsilon$ is a constant in $(0,\frac{1}{5})$.

preprint2022arXiv

On-Device Learning with Cloud-Coordinated Data Augmentation for Extreme Model Personalization in Recommender Systems

Data heterogeneity is an intrinsic property of recommender systems, making models trained over the global data on the cloud, which is the mainstream in industry, non-optimal to each individual user's local data distribution. To deal with data heterogeneity, model personalization with on-device learning is a potential solution. However, on-device training using a user's small size of local samples will incur severe overfitting and undermine the model's generalization ability. In this work, we propose a new device-cloud collaborative learning framework, called CoDA, to break the dilemmas of purely cloud-based learning and on-device learning. The key principle of CoDA is to retrieve similar samples from the cloud's global pool to augment each user's local dataset to train the recommendation model. Specifically, after a coarse-grained sample matching on the cloud, a personalized sample classifier is further trained on each device for a fine-grained sample filtering, which can learn the boundary between the local data distribution and the outside data distribution. We also build an end-to-end pipeline to support the flows of data, model, computation, and control between the cloud and each device. We have deployed CoDA in a recommendation scenario of Mobile Taobao. Online A/B testing results show the remarkable performance improvement of CoDA over both cloud-based learning without model personalization and on-device training without data augmentation. Overhead testing on a real device demonstrates the computation, storage, and communication efficiency of the on-device tasks in CoDA.

preprint2022arXiv

Optimal Sampling Gaps for Adaptive Submodular Maximization

Running machine learning algorithms on large and rapidly growing volumes of data is often computationally expensive, one common trick to reduce the size of a data set, and thus reduce the computational cost of machine learning algorithms, is \emph{probability sampling}. It creates a sampled data set by including each data point from the original data set with a known probability. Although the benefit of running machine learning algorithms on the reduced data set is obvious, one major concern is that the performance of the solution obtained from samples might be much worse than that of the optimal solution when using the full data set. In this paper, we examine the performance loss caused by probability sampling in the context of adaptive submodular maximization. We consider a simple probability sampling method which selects each data point with probability at least $r\in[0,1]$. If we set $r=1$, our problem reduces to finding a solution based on the original full data set. We define sampling gap as the largest ratio between the optimal solution obtained from the full data set and the optimal solution obtained from the samples, over independence systems. Our main contribution is to show that if the sampling probability of each data point is at least $r$ and the utility function is policywise submodular, then the sampling gap is both upper bounded and lower bounded by $1/r$. We show that the property of policywise submodular can be found in a wide range of real-world applications, including pool-based active learning and adaptive viral marketing.

preprint2022arXiv

Robust Adaptive Submodular Maximization

The goal of a sequential decision making problem is to design an interactive policy that adaptively selects a group of items, each selection is based on the feedback from the past, in order to maximize the expected utility of selected items. It has been shown that the utility functions of many real-world applications are adaptive submodular. However, most of existing studies on adaptive submodular optimization focus on the average-case. Unfortunately, a policy that has a good average-case performance may have very poor performance under the worst-case realization. In this study, we propose to study two variants of adaptive submodular optimization problems, namely, worst-case adaptive submodular maximization and robust submodular maximization. The first problem aims to find a policy that maximizes the worst-case utility and the latter one aims to find a policy, if any, that achieves both near optimal average-case utility and worst-case utility simultaneously. We introduce a new class of stochastic functions, called \emph{worst-case submodular function}. For the worst-case adaptive submodular maximization problem subject to a $p$-system constraint, we develop an adaptive worst-case greedy policy that achieves a $\frac{1}{p+1}$ approximation ratio against the optimal worst-case utility if the utility function is worst-case submodular. For the robust adaptive submodular maximization problem subject to cardinality constraints (resp. partition matroid constraints), if the utility function is both worst-case submodular and adaptive submodular, we develop a hybrid adaptive policy that achieves an approximation close to $1-e^{-\frac{1}{2}}$ (resp. $1/3$) under both worst- and average-case settings simultaneously. We also describe several applications of our theoretical results, including pool-base active learning, stochastic submodular set cover and adaptive viral marketing.

preprint2022arXiv

Streaming Adaptive Submodular Maximization

Many sequential decision making problems can be formulated as an adaptive submodular maximization problem. However, most of existing studies in this field focus on pool-based setting, where one can pick items in any order, and there have been few studies for the stream-based setting where items arrive in an arbitrary order and one must immediately decide whether to select an item or not upon its arrival. In this paper, we introduce a new class of utility functions, semi-policywise submodular functions. We develop a series of effective algorithms to maximize a semi-policywise submodular function under the stream-based setting.

preprint2022arXiv

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment, while facilitating daily task iteration. Specifically, the compute container is based on Mobile Neural Network (MNN), a tensor compute engine along with the data processing and model execution libraries, which are exposed through a refined Python thread-level virtual machine (VM) to support diverse ML tasks and concurrent task execution. The core of MNN is the novel mechanisms of operator decomposition and semi-auto search, sharply reducing the workload in manually optimizing hundreds of operators for tens of hardware backends and further quickly identifying the best backend with runtime optimization for a computation graph. The data pipeline introduces an on-device stream processing framework to enable processing user behavior data at source. The deployment platform releases ML tasks with an efficient push-then-pull method and supports multi-granularity deployment policies. We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability. Extensive micro-benchmarks also highlight the superior performance of MNN and the Python thread-level VM. Walle has been in large-scale production use in Alibaba, while MNN has been open source with a broad impact in the community.

preprint2021arXiv

A Deep Learning-Based Approach to Extracting Periosteal and Endosteal Contours of Proximal Femur in Quantitative CT Images

Automatic CT segmentation of proximal femur is crucial for the diagnosis and risk stratification of orthopedic diseases; however, current methods for the femur CT segmentation mainly rely on manual interactive segmentation, which is time-consuming and has limitations in both accuracy and reproducibility. In this study, we proposed an approach based on deep learning for the automatic extraction of the periosteal and endosteal contours of proximal femur in order to differentiate cortical and trabecular bone compartments. A three-dimensional (3D) end-to-end fully convolutional neural network, which can better combine the information between neighbor slices and get more accurate segmentation results, was developed for our segmentation task. 100 subjects aged from 50 to 87 years with 24,399 slices of proximal femur CT images were enrolled in this study. The separation of cortical and trabecular bone derived from the QCT software MIAF-Femur was used as the segmentation reference. We randomly divided the whole dataset into a training set with 85 subjects for 10-fold cross-validation and a test set with 15 subjects for evaluating the performance of models. Two models with the same network structures were trained and they achieved a dice similarity coefficient (DSC) of 97.87% and 96.49% for the periosteal and endosteal contours, respectively. To verify the excellent performance of our model for femoral segmentation, we measured the volume of different parts of the femur and compared it with the ground truth and the relative errors between predicted result and ground truth are all less than 5%. It demonstrated a strong potential for clinical use, including the hip fracture risk prediction and finite element analysis.

preprint2021arXiv

A Survey on Incorporating Domain Knowledge into Deep Learning for Medical Image Analysis

Although deep learning models like CNNs have achieved great success in medical image analysis, the small size of medical datasets remains a major bottleneck in this area. To address this problem, researchers have started looking for external information beyond current available medical datasets. Traditional approaches generally leverage the information from natural images via transfer learning. More recent works utilize the domain knowledge from medical doctors, to create networks that resemble how medical doctors are trained, mimic their diagnostic patterns, or focus on the features or areas they pay particular attention to. In this survey, we summarize the current progress on integrating medical domain knowledge into deep learning models for various tasks, such as disease diagnosis, lesion, organ and abnormality detection, lesion and organ segmentation. For each task, we systematically categorize different kinds of medical domain knowledge that have been utilized and their corresponding integrating methods. We also provide current challenges and directions for future research.

preprint2021arXiv

Adaptive Cascade Submodular Maximization

In this paper, we propose and study the cascade submodular maximization problem under the adaptive setting. The input of our problem is a set of items, each item is in a particular state (i.e., the marginal contribution of an item) which is drawn from a known probability distribution. However, we can not know its actual state before selecting it. As compared with existing studies on stochastic submodular maximization, one unique setting of our problem is that each item is associated with a continuation probability which represents the probability that one is allowed to continue to select the next item after selecting the current one. Intuitively, this term captures the externality of selecting one item to all its subsequent items in terms of the opportunity of being selected. Therefore, the actual set of items that can be selected by a policy depends on the specific ordering it adopts to select items, this makes our problem fundamentally different from classical submodular set optimization problems. Our objective is to identify the best sequence of selecting items so as to maximize the expected utility of the selected items. We propose a class of stochastic utility functions, \emph{adaptive cascade submodular functions}, and show that the objective functions in many practical application domains satisfy adaptive cascade submodularity. Then we develop a $0.12$ approximation algorithm to the adaptive cascade submodular maximization problem.

preprint2021arXiv

Adaptive Regularized Submodular Maximization

In this paper, we study the problem of maximizing the difference between an adaptive submodular (revenue) function and an non-negative modular (cost) function under the adaptive setting. The input of our problem is a set of $n$ items, where each item has a particular state drawn from some known prior distribution $p$. The revenue function $g$ is defined over items and states, and the cost function $c$ is defined over items, i.e., each item has a fixed cost. The state of each item is unknown initially, one must select an item in order to observe its realized state. A policy $π$ specifies which item to pick next based on the observations made so far. Denote by $g_{avg}(π)$ the expected revenue of $π$ and let $c_{avg}(π)$ denote the expected cost of $π$. Our objective is to identify the best policy $π^o\in \arg\max_πg_{avg}(π)-c_{avg}(π)$ under a $k$-cardinality constraint. Since our objective function can take on both negative and positive values, the existing results of submodular maximization may not be applicable. To overcome this challenge, we develop a series of effective solutions with performance grantees. Let $π^o$ denote the optimal policy. For the case when $g$ is adaptive monotone and adaptive submodular, we develop an effective policy $π^l$ such that $g_{avg}(π^l) - c_{avg}(π^l) \geq (1-\frac{1}{e}-ε)g_{avg}(π^o) - c_{avg}(π^o)$, using only $O(nε^{-2}\log ε^{-1})$ value oracle queries. For the case when $g$ is adaptive submodular, we present a randomized policy $π^r$ such that $g_{avg}(π^r) - c_{avg}(π^r) \geq \frac{1}{e}g_{avg}(π^o) - c_{avg}(π^o)$.

preprint2020arXiv

A Novel Method for ECG Signal Classification via One-Dimensional Convolutional Neural Network

This paper presents an end-to-end ECG signal classification method based on a novel segmentation strategy via 1D Convolutional Neural Networks (CNN) to aid the classification of ECG signals. The ECG segmentation strategy named R-R-R strategy (i.e., retaining ECG data between the R peaks just before and after the current R peak) for segmenting the original ECG data into segments in order to train and test the 1D CNN models. The novel strategy mimics physicians in scanning ECG to a greater extent, and maximizes the inherent information of ECG segments. The performance of the classification models for 5-class and 6-class are verified with ECG signals from 48 records of the MIT-BIH arrhythmia database. As the heartbeat types are divided into 5 classes (i.e., normal beat, left bundle branch block beat, right bundle branch block beat, ventricular ectopic beat, and paced beat) in the MIT-BIH, the best classification accuracy, the area under the curve (AUC), the sensitivity and the F1-score reach 99.24%, 0.9994, 0.99 and 0.99, respectively. As the heartbeat types are divided into 6 classes (i.e., normal beat, left bundle branch block beat, right bundle branch block beat, ventricular ectopic beat, paced beat and other beats) in the MIT-BIH, the beat classification accuracy, the AUC, the sensitivity, and the F1-score reach 97.02%, 0.9966, 0.97, and 0.97, respectively. Meanwhile, according to the recommended practice from the Association for Advancement of Medical Instrumentation (AAMI), the heartbeat types are divided into 5 classes (i.e., normal beat, supraventricular ectopic beats, ventricular ectopic beats, fusion beats, and unclassifiable beats), the beat classification accuracy, the sensitivity, and the F1-score reach 97.45%, 0.97, and 0.97, respectively. The experimental results show that the proposed method achieves better performance than the state-of-the-art methods.

preprint2020arXiv

Assortment Optimization with Repeated Exposures and Product-dependent Patience Cost

In this paper, we study the assortment optimization problem faced by many online retailers such as Amazon. We develop a \emph{cascade multinomial logit model}, based on the classic multinomial logit model, to capture the consumers' purchasing behavior across multiple stages. Different from existing studies, our model allows for repeated exposures of a product, i.e., the same product can be displayed multiple times across different stages. In addition, each consumer has a \emph{patience budget} that is sampled from a known distribution and each product is associated with a \emph{patience cost}, which captures the cognitive efforts spent on browsing that product. Given an assortment of products, a consumer sequentially browses them stage by stage. After browsing all products in one stage, if the utility of a product exceeds the utility of the outside option, the consumer proceeds to purchase the product and leave the platform. Otherwise, if the patience cost of all products browsed up to that point is no larger than her patience budget, she continues to view the next stage. We propose an approximation solution to this problem.

preprint2020arXiv

Distributed Optimization over Block-Cyclic Data

We consider practical data characteristics underlying federated learning, where unbalanced and non-i.i.d. data from clients have a block-cyclic structure: each cycle contains several blocks, and each client's training data follow block-specific and non-i.i.d. distributions. Such a data structure would introduce client and block biases during the collaborative training: the single global model would be biased towards the client or block specific data. To overcome the biases, we propose two new distributed optimization algorithms called multi-model parallel SGD (MM-PSGD) and multi-chain parallel SGD (MC-PSGD) with a convergence rate of $O(1/\sqrt{NT})$, achieving a linear speedup with respect to the total number of clients. In particular, MM-PSGD adopts the block-mixed training strategy, while MC-PSGD further adds the block-separate training strategy. Both algorithms create a specific predictor for each block by averaging and comparing the historical global models generated in this block from different cycles. We extensively evaluate our algorithms over the CIFAR-10 dataset. Evaluation results demonstrate that our algorithms significantly outperform the conventional federated averaging algorithm in terms of test accuracy, and also preserve robustness for the variance of critical parameters.

preprint2020arXiv

Linear-Time Algorithms for Adaptive Submodular Maximization

In this paper, we develop fast algorithms for two stochastic submodular maximization problems. We start with the well-studied adaptive submodular maximization problem subject to a cardinality constraint. We develop the first linear-time algorithm which achieves a $(1-1/e-ε)$ approximation ratio. Notably, the time complexity of our algorithm is $O(n\log\frac{1}ε)$ (number of function evaluations) which is independent of the cardinality constraint, where $n$ is the size of the ground set. Then we introduce the concept of fully adaptive submodularity, and develop a linear-time algorithm for maximizing a fully adaptive submoudular function subject to a partition matroid constraint. We show that our algorithm achieves a $\frac{1-1/e-ε}{4-2/e-2ε}$ approximation ratio using only $O(n\log\frac{1}ε)$ number of function evaluations.

preprint2020arXiv

SelectScale: Mining More Patterns from Images via Selective and Soft Dropout

Convolutional neural networks (CNNs) have achieved remarkable success in image recognition. Although the internal patterns of the input images are effectively learned by the CNNs, these patterns only constitute a small proportion of useful patterns contained in the input images. This can be attributed to the fact that the CNNs will stop learning if the learned patterns are enough to make a correct classification. Network regularization methods like dropout and SpatialDropout can ease this problem. During training, they randomly drop the features. These dropout methods, in essence, change the patterns learned by the networks, and in turn, forces the networks to learn other patterns to make the correct classification. However, the above methods have an important drawback. Randomly dropping features is generally inefficient and can introduce unnecessary noise. To tackle this problem, we propose SelectScale. Instead of randomly dropping units, SelectScale selects the important features in networks and adjusts them during training. Using SelectScale, we improve the performance of CNNs on CIFAR and ImageNet.