Researcher profile

Yan Gu

Yan Gu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2026arXiv

Parallel Dynamic Spatial Indexes

Maintaining spatial data (points in two or three dimensions) is crucial and has a wide range of applications, such as graphics, GIS, and robotics. To handle spatial data, many data structures, called spatial indexes, have been proposed, e.g. kd-trees, oct/quadtrees (also called Orth-trees), R-trees, and bounding volume hierarchies (BVHs). In real-world applications, spatial datasets tend to be highly dynamic, requiring batch updates of points with low latency. This calls for efficient parallel batch updates on spatial indexes. Unfortunately, there is very little work that achieves this. In this paper, we systematically study parallel spatial indexes, with a special focus on achieving high-performance update performance for highly dynamic workloads. We select two types of spatial indexes that are considered optimized for low-latency updates: Orth-tree and R-tree/BVH. We propose two data structures: the P-Orth tree, a parallel Orth-tree, and the SPaC-tree family, a parallel R-tree/BVH. Both the P-Orth tree and the SPaC-tree deliver superior performance in batch updates compared to existing parallel kd-trees and Orth-trees, while preserving better or competitive query performance relative to their corresponding Orth-tree and R-tree counterparts. We also present comprehensive experiments comparing the performance of various parallel spatial indexes and share our findings at the end of the paper.

preprint2026arXiv

Provably Fast and Space-Efficient Parallel Biconnectivity

Biconnectivity is one of the most fundamental graph problems. The canonical parallel biconnectivity algorithm is the Tarjan-Vishkin algorithm, which has $O(n+m)$ optimal work (number of operations) and polylogarithmic span (longest dependent operations) on a graph with $n$ vertices and $m$ edges. However, Tarjan-Vishkin is not widely used in practice. We believe the reason is the space-inefficiency (it generates an auxiliary graph with $O(m)$ edges). In practice, existing parallel implementations are based on breath-first search (BFS). Since BFS has span proportional to the diameter of the graph, existing parallel BCC implementations suffer from poor performance on large-diameter graphs and can be even slower than the sequential algorithm on many real-world graphs. We propose the first parallel biconnectivity algorithm (FAST-BCC) that has optimal work, polylogarithmic span, and is space-efficient. Our algorithm first generates a skeleton graph based on any spanning tree of the input graph. Then we use the connectivity information of the skeleton to compute the biconnectivity of the original input. All the steps in our algorithm are highly-parallel. We carefully analyze the correctness of our algorithm, which is highly non-trivial. We implemented FAST-BCC and compared it with existing implementations, including GBBS, Slota and Madduri's algorithm, and the sequential Hopcroft-Tarjan algorithm. We ran them on a 96-core machine on 27 graphs, including social, web, road, $k$-NN, and synthetic graphs, with significantly varying sizes and edge distributions. FAST-BCC is the fastest on all 27 graphs. On average (geometric means), FAST-BCC is 5.1$\times$ faster than GBBS, and 3.1$\times$ faster than the best existing baseline on each graph.

preprint2024arXiv

Spectral integrated neural networks (SINNs) for solving forward and inverse dynamic problems

This paper proposes a novel neural network framework, denoted as spectral integrated neural networks (SINNs), for resolving three-dimensional forward and inverse dynamic problems. In the SINNs, the spectral integration method is applied to perform temporal discretization, and then a fully connected neural network is adopted to solve resulting partial differential equations (PDEs) in the spatial domain. Specifically, spatial coordinates are employed as inputs in the network architecture, and the output layer is configured with multiple outputs, each dedicated to approximating solutions at different time instances characterized by Gaussian points used in the spectral method. By leveraging the automatic differentiation technique and spectral integration scheme, the SINNs minimize the loss function, constructed based on the governing PDEs and boundary conditions, to obtain solutions for dynamic problems. Additionally, we utilize polynomial basis functions to expand the unknown function, aiming to enhance the performance of SINNs in addressing inverse problems. The conceived framework is tested on six forward and inverse dynamic problems, involving nonlinear PDEs. Numerical results demonstrate the superior performance of SINNs over the popularly used physics-informed neural networks in terms of convergence speed, computational accuracy and efficiency. It is also noteworthy that the SINNs exhibit the capability to deliver accurate and stable solutions for long-time dynamic problems.

preprint2023arXiv

Real-Time Walking Pattern Generation of Quadrupedal Dynamic-Surface Locomotion based on a Linear Time-Varying Pendulum Model

This study introduces an analytically tractable and computationally efficient model of the legged robot dynamics associated with locomotion on a dynamic rigid surface (DRS), and develops a real-time motion planner based on the proposed model and its analytical solution. This study first theoretically extends the classical linear inverted pendulum (LIP) model from legged locomotion on a static surface to DRS locomotion, by relaxing the LIP's underlying assumption that the surface is static. The resulting model, which we call "DRS-LIP", is explicitly time-varying. After converting the DRS-LIP into Mathieu's equation, an approximate analytical solution of the DRS-LIP is obtained, which is reasonably accurate with a low computational cost. Furthermore, to illustrate the practical uses of the analytical results, they are exploited to develop a hierarchical motion planner that efficiently generates physically feasible trajectories for DRS locomotion. Finally, the effectiveness of the proposed theoretical results and motion planner is demonstrated both through PyBullet simulations and experimentally on a Laikago quadrupedal robot that walks on a rocking treadmill. The videos of simulations and hardware experiments are available at https://youtu.be/u2Q_u2pR99c.

preprint2022arXiv

$PT$-symmetric non-Hermitian Hamiltonian and invariant operator in periodically driven $SU(1,1)$ system

We study in this paper the time evolution of $PT$-symmetric non-Hermitian Hamiltonian consisting of periodically driven $SU(1,1)$ generators. A non-Hermitian invariant operator is adopted to solve the Schrödinger equation, since the time-dependent Hamiltonian is no longer a conserved quantity. We propose a scheme to construct the non-Hermitian invariant with a $PT$-symmetric but non-unitary transformation operator. The eigenstates of invariant and its complex conjugate form a bi-orthogonal basis to formulate the exact solution. We obtain the non-adiabatic Berry phase, which reduces to the adiabatic one in the slow time-variation limit. A non-unitary time-evolution operator is found analytically. As an consequence of the non-unitarity the ket ($|ψ(t)\rangle $) and bra ($\langle ψ(t)|$) states are not normalized each other. While the inner product of two states can be evaluated with the help of a metric operator. It is shown explicitly that the model can be realized by a periodically driven oscillator.

preprint2022arXiv

DRS-LIP: Linear Inverted Pendulum Model for Legged Locomotion on Dynamic Rigid Surfaces

Legged robot locomotion on a dynamic rigid surface (i.e., a rigid surface moving in the inertial frame) involves complex full-order dynamics that is high-dimensional, nonlinear, and time-varying. Towards deriving an analytically tractable dynamic model, this study theoretically extends the reduced-order linear inverted pendulum (LIP) model from legged locomotion on a stationary surface to locomotion on a dynamic rigid surface (DRS). The resulting model is herein termed as DRS-LIP. Furthermore, this study introduces an approximate analytical solution of the proposed DRS-LIP that is computationally efficient with high accuracy. To illustrate the practical uses of the analytical results, they are used to develop a hierarchical planning framework that efficiently generates physically feasible trajectories for DRS locomotion. The effectiveness of the proposed theoretical results and motion planner is demonstrated both through simulations and experimentally on a Laikago quadrupedal robot that walks on a rocking treadmill.

preprint2022arXiv

Invariant Filtering for Legged Humanoid Locomotion on Dynamic Rigid Surfaces

State estimation for legged locomotion over a dynamic rigid surface (DRS), which is a rigid surface moving in the world frame (e.g., ships, aircraft, and trains), remains an under-explored problem. This paper introduces an invariant extended Kalman filter that estimates the robot's pose and velocity during DRS locomotion by using common sensors of legged robots (e.g., inertial measurement units (IMU), joint encoders, and RDB-D camera). A key feature of the filter lies in that it explicitly addresses the nonstationary surface-foot contact point and the hybrid robot behaviors. Another key feature is that, in the absence of IMU biases, the filter satisfies the attractive group affine and invariant observation conditions, and is thus provably convergent for the deterministic continuous phases. The observability analysis is performed to reveal the effects of DRS movement on the state observability, and the convergence property of the hybrid, deterministic filter system is examined for the observable state variables. Experiments of a Digit humanoid robot walking on a pitching treadmill validate the effectiveness of the proposed filter under large estimation errors and moderate DRS movement. The video of the experiments can be found at: https://youtu.be/ScQIBFUSKzo.

preprint2022arXiv

Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficient

To design efficient parallel algorithms, some recent papers showed that many sequential iterative algorithms can be directly parallelized but there are still challenges in achieving work-efficiency and high-parallelism. Work-efficiency can be hard for certain problems where the number of dependences is asymptotically more than optimal sequential work bound. To achieve high-parallelism, we want to process as many objects as possible in parallel. The goal is to achieve $\tilde{O}(D)$ span for a problem with the deepest dependence length $D$. We refer to this property as round-efficiency. In this paper, we show work-efficient and round-efficient algorithms for a variety of classic problems and propose general approaches to do so. To efficiently parallelize many sequential iterative algorithms, we propose the phase-parallel framework. The framework assigns a rank to each object and processes them accordingly. All objects with the same rank can be processed in parallel. To enable work-efficiency and high parallelism, we use two types of general techniques. Type 1 algorithms aim to use range queries to extract all objects with the same rank, such that we avoid evaluating all the dependences. We discuss activity selection, unlimited knapsack, and more using Type 1 framework. Type 2 algorithms aim to wake up an object when the last object it depends on is finished. We discuss activity selection, longest increasing subsequence (LIS), and many other algorithms using Type 2 framework. All of our algorithms are (nearly) work-efficient and round-efficient. Many of them improve previous best bounds, and some of them are the first to achieve work-efficiency with round-efficiency. We also implement many of them. On inputs with reasonable dependence depth, our algorithms are highly parallelized and significantly outperform their sequential counterparts.

preprint2022arXiv

PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections

Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and lightweight snapshots. However, existing high-performance functional interfaces such as PAM, which are based on balanced purely-functional trees, incur large space overheads for large-scale data analysis due to storing every element in a separate node in a tree. This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks using arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC-trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient. Using PaC-trees we designed CPAM, a C++ library that implements the full functionality of PAM, while offering significant extra functionality for compression. CPAM consistently matches or outperforms PAM on a set of microbenchmarks on sets, maps, and sequences while using about a quarter of the space. On applications including inverted indices, 2D range queries, and 1D interval queries, CPAM is competitive with or faster than PAM, while using 2.1--7.8x less space. For static and streaming graph processing, CPAM offers 1.6x faster batch updates while using 1.3--2.6x less space than the state-of-the-art graph processing system Aspen.

preprint2022arXiv

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

preprint2022arXiv

ParGeo: A Library for Parallel Computational Geometry

This paper presents ParGeo, a multicore library for computational geometry. ParGeo contains modules for fundamental tasks including $k$d-tree based spatial search, spatial graph generation, and algorithms in computational geometry. We focus on three new algorithmic contributions provided in the library. First, we present a new parallel convex hull algorithm based on a reservation technique to enable parallel modifications to the hull. We also provide the first parallel implementations of the randomized incremental convex hull algorithm as well as a divide-and-conquer convex hull algorithm in $\mathbb{R}^3$. Second, for the smallest enclosing ball problem, we propose a new sampling-based algorithm to quickly reduce the size of the data set. We also provide the first parallel implementation of Welzl's classic algorithm for smallest enclosing ball. Third, we present the BDL-tree, a parallel batch-dynamic $k$d-tree that allows for efficient parallel updates and $k$-NN queries over dynamically changing point sets. BDL-trees consist of a log-structured set of $k$d-trees which can be used to efficiently insert, delete, and query batches of points in parallel. On 36 cores with two-way hyper-threading, our fastest convex hull algorithm achieves up to 44.7x self-relative parallel speedup and up to 559x speedup against the best existing sequential implementation. Our smallest enclosing ball algorithm using our sampling-based algorithm achieves up to 27.1x self-relative parallel speedup and up to 178x speedup against the best existing sequential implementation. Our implementation of the BDL-tree achieves self-relative parallel speedup of up to 46.1x. Across all of the algorithms in ParGeo, we achieve self-relative parallel speedup of 8.1--46.61x.

preprint2021arXiv

Generalized Bell-like inequality and maximum violation for multiparticle entangled Schrödinger-cat-states of spin-s

This paper proposes a generalized Bell-like inequality (GBI) for multiparticle entangled Schrödinger-cat--states of arbitrary spin-$s$. Based on quantum probability statistics the GBI and violation are formulated in an unified manner with the help of state density operator, which can be separated to local and non-local parts. The local part gives rise to the inequality, while the non-local part is responsible for the violation. The GBI is not violated at all by quantum average except the spin-$1/2$ entangled states. If the measuring outcomes are restricted in the subspace of spin coherent state (SCS), namely, only the maximum spin values $\pm s$, the GBI is still meaningful for the incomplete measurement. With the help of SCS quantum probability statistics, it is proved that the violation of GBI can occur only for half-integer spins but not integer spins. Moreover, the maximum violation bound depends on the number parity of entangled particles, that it is $1/2$ for the odd particle-numbers while $1$ for even numbers.

preprint2021arXiv

Parallel In-Place Algorithms: Theory and Practice

Many parallel algorithms use at least linear auxiliary space in the size of the input to enable computations to be done independently without conflicts. Unfortunately, this extra space can be prohibitive for memory-limited machines, preventing large inputs from being processed. Therefore, it is desirable to design parallel in-place algorithms that use sublinear (or even polylogarithmic) auxiliary space. In this paper, we bridge the gap between theory and practice for parallel in-place (PIP) algorithms. We first define two computational models based on fork-join parallelism, which reflect modern parallel programming environments. We then introduce a variety of new parallel in-place algorithms that are simple and efficient, both in theory and in practice. Our algorithmic highlight is the Decomposable Property introduced in this paper, which enables existing non-in-place but highly-optimized parallel algorithms to be converted into parallel in-place algorithms. Using this property, we obtain algorithms for random permutation, list contraction, tree contraction, and merging that take linear work, $O(n^{1-ε})$ auxiliary space, and $O(n^ε\cdot\text{polylog}(n))$ span for $0<ε<1$. We also present new parallel in-place algorithms for scan, filter, merge, connectivity, biconnectivity, and minimum spanning forest using other techniques. In addition to theoretical results, we present experimental results for implementations of many of our parallel in-place algorithms. We show that on a 72-core machine with two-way hyper-threading, the parallel in-place algorithms usually outperform existing parallel algorithms for the same problems that use linear auxiliary space, indicating that the theory developed in this paper indeed leads to practical benefits in terms of both space usage and running time.

preprint2021arXiv

Theoretically-Efficient and Practical Parallel DBSCAN

The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take $O(n\log n)$ work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms.

preprint2020arXiv

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an $Ω(\log n)$ overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized.

preprint2020arXiv

Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

Non-volatile main memory (NVRAM) technologies provide an attractive set of features for large-scale graph analytics, including byte-addressability, low idle power, and improved memory-density. NVRAM systems today have an order of magnitude more NVRAM than traditional memory (DRAM). NVRAM systems could therefore potentially allow very large graph problems to be solved on a single machine, at a modest cost. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be much more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics using the Parallel Semi-Asymmetric Model (PSAM), in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the PSAM approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. To experimentally study this new setting, we develop Sage, a parallel semi-asymmetric graph engine with which we implement provably-efficient (and often work-optimal) PSAM algorithms for over a dozen fundamental graph problems. We experimentally study Sage using a 48-core machine on the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) equipped with Optane DC Persistent Memory, and show that Sage outperforms the fastest prior systems designed for NVRAM. Importantly, we also show that Sage nearly matches the fastest prior systems running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM.

preprint2019arXiv

Measuring outcome correlation for spin-s Bell cat-state and geometric phase induced spin parity effect

In terms of quantum probability statistics the Bell inequality (BI) and its violation are extended to spin-$s$ entangled Schrödinger cat-state (called the Bell cat-state) with both parallel and antiparallel spin-polarizations. The BI is never ever violated for the measuring outcome probabilities evaluated over entire two-spin Hilbert space except the spin-$1/2$ entangled states. A universal Bell-type inequality (UBI) denoted by $p_{s}^{lc}\leq0$ is formulated with the local realistic model under the condition that the measuring outcomes are restricted in the subspace of spin coherent states. A spin parity effect is observed that the UBI can be violated only by the Bell cat-states of half-integer but not the integer spins. The violation of UBI is seen to be a direct result of non-trivial Berry phase between the spin coherent states of south- and north-pole gauges for half-integer spin, while the geometric phase is trivial for the integer spins. A maximum violation bound of UBI is found as $p_{s}^{\max}$=1, which is valid for arbitrary half-integer spin-$s$ states.