Researcher profile

Yuanhao Wei

Yuanhao Wei contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

Trace Validation of Unmodified Concurrent Systems with OmniLink

Concurrent systems are notoriously difficult to validate: subtle bugs may only manifest under rare thread interleavings, and existing tools often require intrusive instrumentation or unrealistic execution models. We present OmniLink, a new methodology for validating concurrent implementations against high-level specifications in TLA+. Unlike prior TLA+ based approaches which use a technique called trace validation, OmniLink treats system events as black boxes with a timebox in which they occurred and a meaning in TLA+, solving for a logical total order of actions. Unlike prior approaches based on linearizability checking, which already solves for total orders of actions with timeboxes, OmniLink uses a flexible specification language, and offers a different linearizability checking method based on off-the-shelf model checking. OmniLink offers different features compared existing linearizability checking tools, and we show that it outperforms the state of the art on large scale validation tasks. Our evaluation validates WiredTiger, a state-of-the-art industrial database storage layer, as well as Balanced Augmented Tree (BAT), a state-of-the art lock-free data structure from the research community, and ConcurrentQueue, a popular lock-free queue featuring aggressive performance optimizations. We use OmniLink to improve WiredTiger's existing TLA+ model, as well as develop new TLA+ models that closely match the behavior of the modeled systems, including non-linearizable behaviors. OmniLink is able to find known bugs injected into the systems under test, as well as help discover two previously unknown bugs (1 in BAT, 1 in ConcurrentQueue), which we have confirmed with the authors of those systems.

preprint2023arXiv

Practically and Theoretically Efficient Garbage Collection for Multiversioning

Multiversioning is widely used in databases, transactional memory, and concurrent data structures. It can be used to support read-only transactions that appear atomic in the presence of concurrent update operations. Any system that maintains multiple versions of each object needs a way of efficiently reclaiming them. We experimentally compare various existing reclamation techniques by applying them to a multiversion tree and a multiversion hash table. Using insights from these experiments, we develop two new multiversion garbage collection (MVGC) techniques. These techniques use two novel concurrent version list data structures. Our experimental evaluation shows that our fastest technique is competitive with the fastest existing MVGC techniques, while using significantly less space on some workloads. Our new techniques provide strong theoretical bounds, especially on space usage. These bounds ensure that the schemes have consistent performance, avoiding the very high worst-case space usage of other techniques.

preprint2022arXiv

Lock-Free Locks Revisited

This paper presents a new and practical approach to lock-free locks based on helping, which allows the user to write code using fine-grained locks, but run it in a lock-free manner. Although lock-free locks have been suggested in the past, they are widely viewed as impractical, have some key limitations, and, as far as we know, have never been implemented. The paper presents some key techniques that make lock-free locks practical and more general. The most important technique is an approach to idempotence -- i.e. making code that runs multiple times appear as if it ran once. The idea is based on using a shared log among processes running the same protected code. Importantly, the approach can be library based, requiring very little if any change to standard code -- code just needs to use the idempotent versions of memory operations (load, store, LL/SC, allocation, free). We have implemented a C++ library called Flock based on the ideas. Flock allows lock-based data structures to run in either lock-free or blocking (traditional locks) mode. We implemented a variety of tree and list-based data structures with Flock and compare the performance of the lock-free and blocking modes under a variety of workloads. The lock-free mode is almost as fast as blocking mode under almost all workloads, and significantly faster when threads are oversubscribed (more threads than processors). We also compare with several existing lock-based and lock-free alternatives.

preprint2022arXiv

Survey of Persistent Memory Correctness Conditions

The study of concurrent persistent programs has seen a surge of activity in recent years due to the introduction of non-volatile random access memories (NVRAM), yielding many models and correctness notions that are difficult to compare. In this paper, we survey existing correctness properties for this setting, placing them into the same context and comparing them. We present a hierarchy of these persistence properties based on the generality of the histories they deem correct, and show how this hierarchy shifts based on different model assumptions.

preprint2022arXiv

Turning Manual Concurrent Memory Reclamation into Automatic Reference Counting

Safe memory reclamation (SMR) schemes are an essential tool for lock-free data structures and concurrent programming. However, manual SMR schemes are notoriously difficult to apply correctly, and automatic schemes, such as reference counting, have been argued for over a decade to be too slow for practical purposes. A recent wave of work has disproved this long-held notion and shown that reference counting can be as scalable as hazard pointers, one of the most common manual techniques. Despite these tremendous improvements, there remains a gap of up to 2x or more in performance between these schemes and faster manual techniques such as epoch-based reclamation (EBR). In this work, we first advance these ideas and show that in many cases, automatic reference counting can in fact be as fast as the fastest manual SMR techniques. We generalize our previous Concurrent Deferred Reference Counting (CDRC) algorithm to obtain a method for converting any standard manual SMR technique into an automatic reference counting technique with a similar performance profile. Our second contribution is extending this framework to support weak pointers, which are reference-counted pointers that automatically break pointer cycles by not contributing to the reference count, thus addressing a common weakness in reference-counted garbage collection. Our experiments with a C++-library implementation show that our automatic techniques perform in line with their manual counterparts, and that our weak pointer implementation outperforms the best known atomic weak pointer library by up to an order of magnitude on high thread counts. All together, we show that the ease of use of automatic memory management can be achieved without significant cost to practical performance or general applicability.

preprint2020arXiv

Concurrent Fixed-Size Allocation and Free in Constant Time

Our goal is to efficiently solve the dynamic memory allocation problem in a concurrent setting where processes run asynchronously. On $p$ processes, we can support allocation and free for fixed-sized blocks with $O(1)$ worst-case time per operation, $Θ(p^2)$ additive space overhead, and using only single-word read, write, and CAS. While many algorithms rely on having constant-time fixed-size allocate and free, we present the first implementation of these two operations that is constant time with reasonable space overhead.

preprint2020arXiv

Concurrent Reference Counting and Resource Management in Wait-free Constant Time

A common problem when implementing concurrent programs is efficiently protecting against unsafe races between processes reading and then using a resource (e.g., memory blocks, file descriptors, or network connections) and other processes that are concurrently overwriting and then destructing the same resource. Such read-destruct races can be protected with locks, or with lock-free solutions such as hazard-pointers or read-copy-update (RCU). In this paper we describe a method for protecting read-destruct races with expected constant time overhead, $O(P^2)$ space and $O(P^2)$ delayed destructs, and with just single word atomic memory operations (reads, writes, and CAS). It is based on an interface with four primitives, an acquire-release pair to protect accesses, and a retire-eject pair to delay the destruct until it is safe. We refer to this as the acquire-retire interface. Using the acquire-retire interface, we develop simple implementations for three common use cases: (1) memory reclamation with applications to stacks and queues, (2) reference counted objects, and (3) objects manage by ownership with moves, copies, and destructs. The first two results significantly improve on previous results, and the third application is original. Importantly, all operations have expected constant time overhead.

preprint2020arXiv

Constant-Time Snapshots with Applications to Concurrent Data Structures

We present an approach for efficiently taking snapshots of the state of a collection of CAS objects. Taking a snapshot allows later operations to read the value that each CAS object had at the time the snapshot was taken. Taking a snapshot requires a constant number of steps and returns a handle to the snapshot. Reading a snapshotted value of an individual CAS object using this handle is wait-free, taking time proportional to the number of successful CASes on the object since the snapshot was taken. Our fast, flexible snapshots yield simple, efficient implementations of atomic multi-point queries on concurrent data structures built from CAS objects. For example, in a search tree where child pointers are updated using CAS, once a snapshot is taken, one can atomically search for ranges of keys, find the first key that matches some criteria, or check if a collection of keys are all present, simply by running a standard sequential algorithm on a snapshot of the tree. To evaluate the performance of our approach, we apply it to two search trees, one balanced and one not. Experiments show that the overhead of supporting snapshots is low across a variety of workloads. Moreover, in almost all cases, range queries on the trees built from our snapshots perform as well as or better than state-of-the-art concurrent data structures that support atomic range queries.

preprint2020arXiv

Delay-Free Concurrency on Faulty Persistent Memory

Non-volatile memory (NVM) promises persistent main memory that remains correct despite loss of power. This has sparked a line of research into algorithms that can recover from a system crash. Since caches are expected to remain volatile, concurrent data structures and algorithms must be redesigned to guarantee that they are left in a consistent state after a system crash, and that the execution can be continued upon recovery. However, the prospect of redesigning every concurrent data structure or algorithm before it can be used in NVM architectures is daunting. In this paper, we present a construction that takes any concurrent program with reads, writes and CASs to shared memory and makes it persistent, i.e., can be continued after one or more processes fault and have to restart. Importantly the converted algorithm has constant computational delay (preserves instruction counts on each process within a constant factor), as well as constant recovery delay (a process can recover from a fault in a constant number of instructions). We show this first for a simple transformation, and then present optimizations to make it more practical, allowing for a tradeoff for better constant factors in computational delay, for sometimes increased recovery delay. We also provide an optimized transformation that works for any normalized lock-free data structure, thus allowing more efficient constructions for a large class of concurrent algorithms. We experimentally evaluate our transformations by applying them to a queue.

preprint2020arXiv

LL/SC and Atomic Copy: Constant Time, Space Efficient Implementations using only pointer-width CAS

When designing concurrent algorithms, Load-Link/Store-Conditional (LL/SC) is often the ideal primitive to have because unlike Compare and Swap (CAS), LL/SC is immune to the ABA problem. However, the full semantics of LL/SC are not supported by any modern machine, so there has been a significant amount of work on simulations of LL/SC using Compare and Swap (CAS), a synchronization primitive that enjoys widespread hardware support. All of the algorithms so far that are constant time either use unbounded sequence numbers (and thus base objects of unbounded size), or require $Ω(MP)$ space for $M$ LL/SC object (where $P$ is the number of processes). We present a constant time implementation of $M$ LL/SC objects using $Θ(M+kP^2)$ space, where $k$ is the maximum number of overlapping LL/SC operations per process (usually a constant), and requiring only pointer-sized CAS objects. Our implementation can also be used to implement $L$-word $LL/SC$ objects in $Θ(L)$ time (for both $LL$ and $SC$) and $Θ((M+kP^2)L)$ space. To achieve these bounds, we begin by implementing a new primitive called Single-Writer Copy which takes a pointer to a word sized memory location and atomically copies its contents into another object. The restriction is that only one process is allowed to write/copy into the destination object at a time. We believe this primitive will be very useful in designing other concurrent algorithms as well.