Source author record

Ye Lin

Ye Lin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence astro-ph.GA Biomolecules Computation and Language cond-mat.mtrl-sci Distributed, Parallel, and Cluster Computing Hardware Architecture Information Retrieval math.OC

Catalog footprint

What is connected

8works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.

preprint2026arXiv

CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device

Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42x and 4.25x speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12x in LBIM compared to HBCEM.

preprint2020arXiv

CSRN: Collaborative Sequential Recommendation Networks for News Retrieval

Nowadays, news apps have taken over the popularity of paper-based media, providing a great opportunity for personalization. Recurrent Neural Network (RNN)-based sequential recommendation is a popular approach that utilizes users' recent browsing history to predict future items. This approach is limited that it does not consider the societal influences of news consumption, i.e., users may follow popular topics that are constantly changing, while certain hot topics might be spreading only among specific groups of people. Such societal impact is difficult to predict given only users' own reading histories. On the other hand, the traditional User-based Collaborative Filtering (UserCF) makes recommendations based on the interests of the "neighbors", which provides the possibility to supplement the weaknesses of RNN-based methods. However, conventional UserCF only uses a single similarity metric to model the relationships between users, which is too coarse-grained and thus limits the performance. In this paper, we propose a framework of deep neural networks to integrate the RNN-based sequential recommendations and the key ideas from UserCF, to develop Collaborative Sequential Recommendation Networks (CSRNs). Firstly, we build a directed co-reading network of users, to capture the fine-grained topic-specific similarities between users in a vector space. Then, the CSRN model encodes users with RNNs, and learns to attend to neighbors and summarize what news they are reading at the moment. Finally, news articles are recommended according to both the user's own state and the summarized state of the neighbors. Experiments on two public datasets show that the proposed model outperforms the state-of-the-art approaches significantly.

preprint2020arXiv

General-Purpose User Embeddings based on Mobile App Usage

In this paper, we report our recent practice at Tencent for user modeling based on mobile app usage. User behaviors on mobile app usage, including retention, installation, and uninstallation, can be a good indicator for both long-term and short-term interests of users. For example, if a user installs Snapseed recently, she might have a growing interest in photographing. Such information is valuable for numerous downstream applications, including advertising, recommendations, etc. Traditionally, user modeling from mobile app usage heavily relies on handcrafted feature engineering, which requires onerous human work for different downstream applications, and could be sub-optimal without domain experts. However, automatic user modeling based on mobile app usage faces unique challenges, including (1) retention, installation, and uninstallation are heterogeneous but need to be modeled collectively, (2) user behaviors are distributed unevenly over time, and (3) many long-tailed apps suffer from serious sparsity. In this paper, we present a tailored AutoEncoder-coupled Transformer Network (AETN), by which we overcome these challenges and achieve the goals of reducing manual efforts and boosting performance. We have deployed the model at Tencent, and both online/offline experiments from multiple domains of downstream applications have demonstrated the effectiveness of the output user embeddings.

preprint2020arXiv

Simultaneous Localization and Parameter Estimation for Single Particle Tracking via Sigma Points based EM

Single Particle Tracking (SPT) is a powerful class of tools for analyzing the dynamics of individual biological macromolecules moving inside living cells. The acquired data is typically in the form of a sequence of camera images that are then post-processed to reveal details about the motion. In this work, we develop an algorithm for jointly estimating both particle trajectory and motion model parameters from the data. Our approach uses Expectation Maximization (EM) combined with an Unscented Kalman filter (UKF) and an Unscented Rauch-Tung-Striebel smoother (URTSS), allowing us to use an accurate, nonlinear model of the observations acquired by the camera. Due to the shot noise characteristics of the photon generation process, this model uses a Poisson distribution to capture the measurement noise inherent in imaging. In order to apply a UKF, we first must transform the measurements into a model with additive Gaussian noise. We consider two approaches, one based on variance stabilizing transformations (where we compare the Anscombe and Freeman-Tukey transforms) and one on a Gaussian approximation to the Poisson distribution. Through simulations, we demonstrate efficacy of the approach and explore the differences among these measurement transformations.

preprint2020arXiv

Towards Fully 8-bit Integer Inference for the Transformer Model

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization. In this work, we show that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived. De-quantization is adopted when necessary, which makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14 En<->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.

preprint2016arXiv

Defect-engineered graphene for bulk supercapacitors with high energy and power densities

The development of high-energy and high-power density supercapacitors (SCs) is critical for enabling next-generation energy storage applications. Nanocarbons are excellent SC electrode materials due to their economic viability, high-surface area, and high stability. Although nanocarbons have high theoretical surface area and hence high double layer capacitance, the net amount of energy stored in nanocarbon-SCs is much below theoretical limits due to two inherent bottlenecks: i) their low quantum capacitance and ii) limited ion-accessible surface area. Here, we demonstrate that defects in graphene could be effectively used to mitigate these bottlenecks by drastically increasing the quantum capacitance and opening new channels to facilitate ion diffusion in otherwise closed interlayer spaces. Our results support the emergence of a new energy paradigm in SCs with 250% enhancement in double layer capacitance beyond the theoretical limit. Furthermore, we demonstrate prototype defect engineered bulk SC devices with energy densities 500% higher than state-of-the-art commercial SCs without compromising the power density.

preprint2014arXiv

The environment of barred galaxies in the low-redshift Universe

We present a study of the environment of barred galaxies using a volume-limited sample of over 30,000 galaxies drawn from the Sloan Digital Sky Survey. We use four different statistics to quantify the environment: the projected two-point cross-correlation function, the background-subtracted number count of neighbor galaxies, the overdensity of the local environment, and the membership of our galaxies to galaxy groups to segregate central and satellite systems. For barred galaxies as a whole, we find a very weak difference in all the quantities compared to unbarred galaxies of the control sample. When we split our sample into early- and late-type galaxies, we see a weak but significant trend for early-type galaxies with a bar to be more strongly clustered on scales from a few 100 kpc to 1 Mpc when compared to unbarred early-type galaxies. This indicates that the presence of a bar in early-type galaxies depends on the location within their host dark matter halos. This is confirmed by the group catalog in the sense that for early-types, the fraction of central galaxies is smaller if they have a bar. For late-type galaxies, we find fewer neighbors within $\sim$50 kpc around the barred galaxies when compared to unbarred galaxies form the control sample, suggesting that tidal forces from close companions suppress the formation/growth of bars. Finally, we find no obvious correlation between overdensity and the bars in our sample, showing that galactic bars are not obviously linked to the large-scale structure of the universe.

Ye Lin

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device

CSRN: Collaborative Sequential Recommendation Networks for News Retrieval

General-Purpose User Embeddings based on Mobile App Usage

Simultaneous Localization and Parameter Estimation for Single Particle Tracking via Sigma Points based EM

Towards Fully 8-bit Integer Inference for the Transformer Model

Defect-engineered graphene for bulk supercapacitors with high energy and power densities

The environment of barred galaxies in the low-redshift Universe