Source author record

Yihan Sun

Yihan Sun appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Molecular Networks Databases Machine Learning

Catalog footprint

What is connected

11works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Parallel Dynamic Spatial Indexes

Maintaining spatial data (points in two or three dimensions) is crucial and has a wide range of applications, such as graphics, GIS, and robotics. To handle spatial data, many data structures, called spatial indexes, have been proposed, e.g. kd-trees, oct/quadtrees (also called Orth-trees), R-trees, and bounding volume hierarchies (BVHs). In real-world applications, spatial datasets tend to be highly dynamic, requiring batch updates of points with low latency. This calls for efficient parallel batch updates on spatial indexes. Unfortunately, there is very little work that achieves this. In this paper, we systematically study parallel spatial indexes, with a special focus on achieving high-performance update performance for highly dynamic workloads. We select two types of spatial indexes that are considered optimized for low-latency updates: Orth-tree and R-tree/BVH. We propose two data structures: the P-Orth tree, a parallel Orth-tree, and the SPaC-tree family, a parallel R-tree/BVH. Both the P-Orth tree and the SPaC-tree deliver superior performance in batch updates compared to existing parallel kd-trees and Orth-trees, while preserving better or competitive query performance relative to their corresponding Orth-tree and R-tree counterparts. We also present comprehensive experiments comparing the performance of various parallel spatial indexes and share our findings at the end of the paper.

preprint2026arXiv

Provably Fast and Space-Efficient Parallel Biconnectivity

Biconnectivity is one of the most fundamental graph problems. The canonical parallel biconnectivity algorithm is the Tarjan-Vishkin algorithm, which has $O(n+m)$ optimal work (number of operations) and polylogarithmic span (longest dependent operations) on a graph with $n$ vertices and $m$ edges. However, Tarjan-Vishkin is not widely used in practice. We believe the reason is the space-inefficiency (it generates an auxiliary graph with $O(m)$ edges). In practice, existing parallel implementations are based on breath-first search (BFS). Since BFS has span proportional to the diameter of the graph, existing parallel BCC implementations suffer from poor performance on large-diameter graphs and can be even slower than the sequential algorithm on many real-world graphs. We propose the first parallel biconnectivity algorithm (FAST-BCC) that has optimal work, polylogarithmic span, and is space-efficient. Our algorithm first generates a skeleton graph based on any spanning tree of the input graph. Then we use the connectivity information of the skeleton to compute the biconnectivity of the original input. All the steps in our algorithm are highly-parallel. We carefully analyze the correctness of our algorithm, which is highly non-trivial. We implemented FAST-BCC and compared it with existing implementations, including GBBS, Slota and Madduri's algorithm, and the sequential Hopcroft-Tarjan algorithm. We ran them on a 96-core machine on 27 graphs, including social, web, road, $k$-NN, and synthetic graphs, with significantly varying sizes and edge distributions. FAST-BCC is the fastest on all 27 graphs. On average (geometric means), FAST-BCC is 5.1$\times$ faster than GBBS, and 3.1$\times$ faster than the best existing baseline on each graph.

preprint2026arXiv

STAGE: Tackling Semantic Drift in Multimodal Federated Graph Learning

Federated graph learning (FGL) enables collaborative training on graph data across multiple clients. As graph data increasingly contain multimodal node attributes such as text and images, multimodal federated graph learning (MM-FGL) has become an important yet substantially harder setting. The key challenge is that clients from different modality domains may not share a common semantic space: even for the same concept, their local encoders can produce inconsistent representations before collaboration begins. This makes direct parameter coordination unreliable and further causes two downstream problems: forcing heterogeneous client representations into a naively shared semantic space may create false semantic agreement, and graph message passing may amplify residual inconsistency across neighborhoods. To address this issue, we propose \textbf{STAGE}, a protocol-first framework for MM-FGL. Instead of relying on direct parameter averaging, STAGE builds a shared semantic space that first translates heterogeneous multimodal features into comparable representations and then regulates how these representations propagate over local graph structures. In this way, STAGE not only improves cross-client semantic calibration, but also reduces the risk of inconsistency amplification during graph learning. Extensive experiments on 8 multimodal-attributed graphs across 5 graph-centric and modality-centric tasks show that STAGE consistently achieves state-of-the-art performance while reducing per-round communication payload.

preprint2022arXiv

Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficient

To design efficient parallel algorithms, some recent papers showed that many sequential iterative algorithms can be directly parallelized but there are still challenges in achieving work-efficiency and high-parallelism. Work-efficiency can be hard for certain problems where the number of dependences is asymptotically more than optimal sequential work bound. To achieve high-parallelism, we want to process as many objects as possible in parallel. The goal is to achieve $\tilde{O}(D)$ span for a problem with the deepest dependence length $D$. We refer to this property as round-efficiency. In this paper, we show work-efficient and round-efficient algorithms for a variety of classic problems and propose general approaches to do so. To efficiently parallelize many sequential iterative algorithms, we propose the phase-parallel framework. The framework assigns a rank to each object and processes them accordingly. All objects with the same rank can be processed in parallel. To enable work-efficiency and high parallelism, we use two types of general techniques. Type 1 algorithms aim to use range queries to extract all objects with the same rank, such that we avoid evaluating all the dependences. We discuss activity selection, unlimited knapsack, and more using Type 1 framework. Type 2 algorithms aim to wake up an object when the last object it depends on is finished. We discuss activity selection, longest increasing subsequence (LIS), and many other algorithms using Type 2 framework. All of our algorithms are (nearly) work-efficient and round-efficient. Many of them improve previous best bounds, and some of them are the first to achieve work-efficiency with round-efficiency. We also implement many of them. On inputs with reasonable dependence depth, our algorithms are highly parallelized and significantly outperform their sequential counterparts.

preprint2022arXiv

PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections

Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and lightweight snapshots. However, existing high-performance functional interfaces such as PAM, which are based on balanced purely-functional trees, incur large space overheads for large-scale data analysis due to storing every element in a separate node in a tree. This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks using arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC-trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient. Using PaC-trees we designed CPAM, a C++ library that implements the full functionality of PAM, while offering significant extra functionality for compression. CPAM consistently matches or outperforms PAM on a set of microbenchmarks on sets, maps, and sequences while using about a quarter of the space. On applications including inverted indices, 2D range queries, and 1D interval queries, CPAM is competitive with or faster than PAM, while using 2.1--7.8x less space. For static and streaming graph processing, CPAM offers 1.6x faster batch updates while using 1.3--2.6x less space than the state-of-the-art graph processing system Aspen.

preprint2020arXiv

Constant-Time Snapshots with Applications to Concurrent Data Structures

We present an approach for efficiently taking snapshots of the state of a collection of CAS objects. Taking a snapshot allows later operations to read the value that each CAS object had at the time the snapshot was taken. Taking a snapshot requires a constant number of steps and returns a handle to the snapshot. Reading a snapshotted value of an individual CAS object using this handle is wait-free, taking time proportional to the number of successful CASes on the object since the snapshot was taken. Our fast, flexible snapshots yield simple, efficient implementations of atomic multi-point queries on concurrent data structures built from CAS objects. For example, in a search tree where child pointers are updated using CAS, once a snapshot is taken, one can atomically search for ranges of keys, find the first key that matches some criteria, or check if a collection of keys are all present, simply by running a standard sequential algorithm on a snapshot of the tree. To evaluate the performance of our approach, we apply it to two search trees, one balanced and one not. Experiments show that the overhead of supporting snapshots is low across a variety of workloads. Moreover, in almost all cases, range queries on the trees built from our snapshots perform as well as or better than state-of-the-art concurrent data structures that support atomic range queries.

preprint2020arXiv

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an $Ω(\log n)$ overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized.

preprint2016arXiv

Parallel Ordered Sets Using Join

The ordered set is one of the most important data type in both theoretical algorithm design and analysis and practical programming. In this paper we study the set operations on two ordered sets, including Union, Intersect and Difference, based on four types of balanced Binary Search Trees (BST) including AVL trees, red-black trees, weight balanced trees and treaps. We introduced only one subroutine Join that needs to be implemented differently for each balanced BST, and on top of which we can implement generic, simple and efficient parallel functions for ordered sets. We first prove the work-efficiency of these Join-based set functions using a generic proof working for all the four types of balanced BSTs. We also implemented and tested our algorithm on all the four balancing schemes. Interestingly the implementations on all four data structures and three set functions perform similarly in time and speedup (more than 45x on 64 cores). We also compare the performance of our implementation to other existing libraries and algorithms.

preprint2016arXiv

Parallel Shortest-Paths Using Radius Stepping

The single-source shortest path problem (SSSP) with nonnegative edge weights is a notoriously difficult problem to solve efficiently in parallel---it is one of the graph problems said to suffer from the transitive-closure bottleneck. In practice, the $Δ$-stepping algorithm of Meyer and Sanders (J. Algorithms, 2003) often works efficiently but has no known theoretical bounds on general graphs. The algorithm takes a sequence of steps, each increasing the radius by a user-specified value $Δ$. Each step settles the vertices in its annulus but can take $Θ(n)$ substeps, each requiring $Θ(m)$ work ($n$ vertices and $m$ edges). In this paper, we describe Radius-Stepping, an algorithm with the best-known tradeoff between work and depth bounds for SSSP with nearly-linear ($\otilde(m)$) work. The algorithm is a $Δ$-stepping-like algorithm but uses a variable instead of fixed-size increase in radii, allowing us to prove a bound on the number of steps. In particular, by using what we define as a vertex $k$-radius, each step takes at most $k+2$ substeps. Furthermore, we define a $(k, ρ)$-graph property and show that if an undirected graph has this property, then the number of steps can be bounded by $O(\frac{n}ρ \log ρL)$, for a total of $O(\frac{kn}ρ \log ρL)$ substeps, each parallel. We describe how to preprocess a graph to have this property. Altogether, Radius-Stepping takes $O((m+n\log n)\log \frac{n}ρ)$ work and $O(\frac{n}ρ\log n \log (ρL))$ depth per source after preprocessing. The preprocessing step can be done in $O(m\log n + nρ^2)$ work and $O(ρ^2)$ depth or in $O(m\log n + nρ^2\log n)$ work and $O(ρ\log ρ)$ depth, and adds no more than $O(nρ)$ edges.

preprint2014arXiv

Fair Evaluation of Global Network Aligners

Biological network alignment identifies topologically and functionally conserved regions between networks of different species. It encompasses two algorithmic steps: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on MI-GRAAL and IsoRankN that combining NCF of one method and AS of another method can lead to a new superior method. Here, we evaluate MI-GRAAL against newer GHOST to potentially further improve alignment quality. Also, we approach several important questions that have not been asked systematically thus far. First, we ask how much of the node similarity information in NCF should come from sequence data compared to topology data. Existing methods determine this more-less arbitrarily, which could affect the resulting alignment(s). Second, when topology is used in NCF, we ask how large the size of the neighborhoods of the compared nodes should be. Existing methods assume that larger neighborhood sizes are better. We find that MI-GRAAL's NCF is superior to GHOST's NCF, while the performance of the methods' ASs is data-dependent. Thus, the combination of MI-GRAAL's NCF and GHOST's AS could be a new superior method for certain data. Also, which amount of sequence information is used within NCF does not affect alignment quality, while the inclusion of topological information is crucial. Finally, larger neighborhood sizes are preferred, but often, it is the second largest size that is superior, and using this size would decrease computational complexity. Together, our results give several general recommendations for a fair evaluation of network alignment methods.

preprint2014arXiv

Simultaneous Optimization of Both Node and Edge Conservation in Network Alignment via WAVE

Network alignment can be used to transfer functional knowledge between conserved regions of different networks. Typically, existing methods use a node cost function (NCF) to compute similarity between nodes in different networks and an alignment strategy (AS) to find high-scoring alignments with respect to the total NCF over all aligned nodes (or node conservation). But, they then evaluate quality of their alignments via some other measure that is different than the node conservation measure used to guide the alignment construction process. Typically, one measures the amount of conserved edges, but only after alignments are produced. Hence, a recent attempt aimed to directly maximize the amount of conserved edges while constructing alignments, which improved alignment accuracy. Here, we aim to directly maximize both node and edge conservation during alignment construction to further improve alignment accuracy. For this, we design a novel measure of edge conservation that (unlike existing measures that treat each conserved edge the same) weighs each conserved edge so that edges with highly NCF-similar end nodes are favored. As a result, we introduce a novel AS, Weighted Alignment VotEr (WAVE), which can optimize any measures of node and edge conservation, and which can be used with any NCF or combination of multiple NCFs. Using WAVE on top of established state-of-the-art NCFs leads to superior alignments compared to the existing methods that optimize only node conservation or only edge conservation or that treat each conserved edge the same. And while we evaluate WAVE in the computational biology domain, it is easily applicable in any domain.

Yihan Sun

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Parallel Dynamic Spatial Indexes

Provably Fast and Space-Efficient Parallel Biconnectivity

STAGE: Tackling Semantic Drift in Multimodal Federated Graph Learning

Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficient

PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections

Constant-Time Snapshots with Applications to Concurrent Data Structures

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

Parallel Ordered Sets Using Join

Parallel Shortest-Paths Using Radius Stepping

Fair Evaluation of Global Network Aligners

Simultaneous Optimization of Both Node and Edge Conservation in Network Alignment via WAVE