Researcher profile

Alex Kogan

Alex Kogan contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

Hapax Locks : Value-Based Mutual Exclusion

We present Hapax Locks, a novel locking algorithm that is simple, enjoys constant-time arrival and unlock paths, provides FIFO admission order, and which is also space efficient and generates relatively little coherence traffic under contention in the common case. Hapax Locks offer performance (both latency and scalability) that is comparable with the best state of the art locks, while at the same time Hapax Locks impose fewer constraints and dependencies on the ambient runtime environment, making them particularly easy to integrate or retrofit into existing systems or under existing application programming interfaces Of particular note, no pointers shift or escape ownership between threads in our algorithm.

preprint2022arXiv

Hemlock : Compact and Scalable Mutual Exclusion

We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under contention, and low latency in the uncontended case. Hemlock is context-free -- not requiring any information to be passed from a lock operation to the corresponding unlock -- and FIFO. The performance of Hemlock is competitive with and often better than the best scalable spin locks.

preprint2021arXiv

Optimizing Inference Performance of Transformers on CPUs

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs. Focusing on the highly popular BERT model, we identify key components of the Transformer architecture where the bulk of the computation happens, and propose three optimizations to speed them up. The optimizations are evaluated using the inference benchmark from HuggingFace, and are shown to achieve the speedup of up to x2.37. The considered optimizations do not require any changes to the implementation of the models nor affect their accuracy.

preprint2020arXiv

Efficient Multi-word Compare and Swap

Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS.

preprint2020arXiv

Fissile Locks

Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any admission order guarantees. Such concerns led to the development of scalable queue-based locks, such as a recent Compact NUMA-aware (CNA) lock, a variant of another popular queue-based MCS lock. CNA scales well under load and provides certain admission guarantees, but has more complicated lock handover operations than TS and incurs higher latencies at low contention. We propose Fissile locks, which capture the most desirable properties of both TS and CNA. A Fissile lock consists of two underlying locks: a TS lock, which serves as a fast path, and a CNA lock, which serves as a slow path. The key feature of Fissile locks is the ability of threads on the fast path to bypass threads enqueued on the slow path, and acquire the lock with less overhead than CNA. Bypass is bounded (by a tunable parameter) to avoid starvation and ensure long-term fairness. The result is a highly scalable NUMA-aware lock with progress guarantees that performs like TS at low contention and like CNA at high contention.

preprint2020arXiv

Scalable Range Locks for Scalable Address Spaces and Beyond

Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing interest in the Linux kernel community seeking to alleviate bottlenecks in the virtual memory management subsystem. The existing implementation of range locks in the kernel, however, uses an internal spin lock to protect the underlying tree structure that keeps track of acquired and requested ranges. This spin lock becomes a point of contention on its own when the range lock is frequently acquired. Furthermore, where and exactly how specific (refined) ranges can be locked remains an open question. In this paper, we make two independent, but related contributions. First, we propose an alternative approach for building range locks based on linked lists. The lists are easy to maintain in a lock-less fashion, and in fact, our range locks do not use any internal locks in the common case. Second, we show how the range of the lock can be refined in the mprotect operation through a speculative mechanism. This refinement, in turn, allows concurrent execution of mprotect operations on non-overlapping memory regions. We implement our new algorithms and demonstrate their effectiveness in user-space and kernel-space, achieving up to 9$\times$ speedup compared to the stock version of the Linux kernel. Beyond the virtual memory management subsystem, we discuss other applications of range locks in parallel software. As a concrete example, we show how range locks can be used to facilitate the design of scalable concurrent data structures, such as skip lists.