Source author record

Lin Zheng

Lin Zheng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language Computer Vision Artificial Intelligence astro-ph.CO astro-ph.HE eess.SP Information Retrieval math.ST physics.comp-ph physics.flu-dyn Statistics Theory

Catalog footprint

What is connected

8works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.

preprint2022arXiv

Linear Complexity Randomized Self-attention Mechanism

Recently, random feature attentions (RFAs) are proposed to approximate the softmax attention in linear time and space complexity by linearizing the exponential kernel. In this paper, we first propose a novel perspective to understand the bias in such approximation by recasting RFAs as self-normalized importance samplers. This perspective further sheds light on an \emph{unbiased} estimator for the whole softmax attention, called randomized attention (RA). RA constructs positive random features via query-specific distributions and enjoys greatly improved approximation fidelity, albeit exhibiting quadratic complexity. By combining the expressiveness in RA and the efficiency in RFA, we develop a novel linear complexity self-attention mechanism called linear randomized attention (LARA). Extensive experiments across various domains demonstrate that RA and LARA significantly improve the performance of RFAs by a substantial margin.

preprint2022arXiv

Poincaré Heterogeneous Graph Neural Networks for Sequential Recommendation

Sequential recommendation (SR) learns users' preferences by capturing the sequential patterns from users' behaviors evolution. As discussed in many works, user-item interactions of SR generally present the intrinsic power-law distribution, which can be ascended to hierarchy-like structures. Previous methods usually handle such hierarchical information by making user-item sectionalization empirically under Euclidean space, which may cause distortion of user-item representation in real online scenarios. In this paper, we propose a Poincaré-based heterogeneous graph neural network named PHGR to model the sequential pattern information as well as hierarchical information contained in the data of SR scenarios simultaneously. Specifically, for the purpose of explicitly capturing the hierarchical information, we first construct a weighted user-item heterogeneous graph by aliening all the user-item interactions to improve the perception domain of each user from a global view. Then the output of the global representation would be used to complement the local directed item-item homogeneous graph convolution. By defining a novel hyperbolic inner product operator, the global and local graph representation learning are directly conducted in Poincaré ball instead of commonly used projection operation between Poincaré ball and Euclidean space, which could alleviate the cumulative error issue of general bidirectional translation process. Moreover, for the purpose of explicitly capturing the sequential dependency information, we design two types of temporal attention operations under Poincaré ball space. Empirical evaluations on datasets from the public and financial industry show that PHGR outperforms several comparison methods.

preprint2022arXiv

Ripple Attention for Visual Perception with Sub-quadratic Complexity

Transformer architectures are now central to sequence modeling tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied in the computer vision domain, where 2D images are first segmented into patches and then treated as 1D sequences. Such linearization, however, impairs the notion of spatial locality in images, which bears important visual clues. To bridge the gap, we propose ripple attention, a sub-quadratic attention mechanism for vision transformers. Built upon the recent kernel-based efficient attention mechanisms, we design a novel dynamic programming algorithm that weights contributions of different tokens to a query with respect to their relative spatial distances in the 2D space in linear observed time. Extensive experiments and analyses demonstrate the effectiveness of ripple attention on various visual tasks.

preprint2020arXiv

Generative Semantic Hashing Enhanced via Boltzmann Machines

Generative semantic hashing is a promising technique for large-scale information retrieval thanks to its fast retrieval speed and small memory footprint. For the tractability of training, existing generative-hashing methods mostly assume a factorized form for the posterior distribution, enforcing independence among the bits of hash codes. From the perspectives of both model representation and code space size, independence is always not the best assumption. In this paper, to introduce correlations among the bits of hash codes, we propose to employ the distribution of Boltzmann machine as the variational posterior. To address the intractability issue of training, we first develop an approximate method to reparameterize the distribution of a Boltzmann machine by augmenting it as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Based on that, an asymptotically-exact lower bound is further derived for the evidence lower bound (ELBO). With these novel techniques, the entire model can be optimized efficiently. Extensive experimental results demonstrate that by effectively modeling correlations among different bits within a hash code, our model can achieve significant performance gains.

preprint2020arXiv

Template Matching and Change Point Detection by M-estimation

We consider the fundamental problem of matching a template to a signal. We do so by M-estimation, which encompasses procedures that are robust to gross errors (i.e., outliers). Using standard results from empirical process theory, we derive the convergence rate and the asymptotic distribution of the M-estimator under relatively mild assumptions. We also discuss the optimality of the estimator, both in finite samples in the minimax sense and in the large-sample limit in terms of local minimaxity and relative efficiency. Although most of the paper is dedicated to the study of the basic shift model in the context of a random design, we consider many extensions towards the end of the paper, including more flexible templates, fixed designs, the agnostic setting, and more.

preprint2014arXiv

Continuous surface force based lattice Boltzmann equation method for simulating thermocapillary flow

In this paper, we extend a lattice Boltzmann equation (LBE) with continuous surface fore (CSF) to simulate thermocapillary flows. The model is designed on our previous CSF LBE for athermal two phase flow, in which the interfacial tension forces and the Marangoni stresses as the results of the interface interactions between different phases are described by a conception of CSF. In this model, the sharp interfaces between different phases are separated by a narrow transition layers, and the kinetics and morphology evolution of phase separation would be characterized by an order parameter visa Cahn-Hilliard equation which is solved in the frame work of LBE. The scalar convection-diffusion equation for temperature field is also solved by thermal LBE. The models are validated by thermal two layered Poiseuille flow, and a two superimposed planar fluids at negligibly small Reynolds and Marangoni numbers for the thermocapillary driven convection, which have analytical solutions for the velocity and temperature. Then thermocapillary migration of two dimensional deformable droplet are simulated. Numerical results show that the predictions of present LBE agreed with the analytical solution/other numerical results.

preprint2012arXiv

Multiple periodic oscillations in the radio light curves of NRAO 530

In this paper, the time series analysis method CLEANest is employed to search for characteristic periodicities in the radio light curves of the blazar NRAO 530 at 4.8, 8.0 and 14.5 GHz over a time baseline of three decades. Two prominent periodicities on time scales of about 6.3 and 9.5 yr are identified at all three frequencies, in agreement with previous results derived from different numerical techniques, confirming the multiplicity of the periodicities in NRAO 530. In addition to these two significant periods, there is also evidence of shorter-timescale periodicities of about 5.0 yr, 4.2 yr, 3.4 yr and 2.8 yr showing lower amplitude in the periodograms. The physical mechanisms responsible for the radio quasi-periodic oscillations and the multiplicity of the periods are discussed.

Lin Zheng

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Linear Complexity Randomized Self-attention Mechanism

Poincaré Heterogeneous Graph Neural Networks for Sequential Recommendation

Ripple Attention for Visual Perception with Sub-quadratic Complexity

Generative Semantic Hashing Enhanced via Boltzmann Machines

Template Matching and Change Point Detection by M-estimation

Continuous surface force based lattice Boltzmann equation method for simulating thermocapillary flow

Multiple periodic oscillations in the radio light curves of NRAO 530