Source author record

Alexander Spiegelman

Alexander Spiegelman appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Cryptography and Security Performance

Catalog footprint

What is connected

12works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing

Block-STM is a parallel execution engine for smart contracts, built around the principles of Software Transactional Memory. Transactions are grouped in blocks, and every execution of the block must yield the same deterministic outcome. Block-STM further enforces that the outcome is consistent with executing transactions according to a preset order, leveraging this order to dynamically detect dependencies and avoid conflicts during speculative transaction execution. At the core of Block-STM is a novel, low-overhead collaborative scheduler of execution and validation tasks. Block-STM is implemented on the main branch of the Diem Blockchain code-base and runs in production at Aptos. Our evaluation demonstrates that Block-STM is adaptive to workloads with different conflict rates and utilizes the inherent parallelism therein. Block-STM achieves up to $110k$ tps in the Diem benchmarks and up to $170k$ tps in the Aptos Benchmarks, which is a $20$x and $17$x improvement over the sequential baseline with $32$ threads, respectively. The throughput on a contended workload is up to $50k$ tps and $80k$ tps in Diem and Aptos benchmarks, respectively.

preprint2022arXiv

Bullshark: DAG BFT Protocols Made Practical

We present Bullshark, the first directed acyclic graph (DAG) based asynchronous Byzantine Atomic Broadcast protocol that is optimized for the common synchronous case. Like previous DAG-based BFT protocols, Bullshark requires no extra communication to achieve consensus on top of building the DAG. That is, parties can totally order the vertices of the DAG by interpreting their local view of the DAG edges. Unlike other asynchronous DAG-based protocols, Bullshark provides a practical low latency fast-path that exploits synchronous periods and deprecates the need for notoriously complex view-change mechanisms. Bullshark achieves this while maintaining all the desired properties of its predecessor DAG-Rider. Namely, it has optimal amortized communication complexity, it provides fairness and asynchronous liveness, and safety is guaranteed even under a quantum adversary. In order to show the practicality and simplicity of our approach, we also introduce a standalone partially synchronous version of Bullshark which we evaluate against the state of the art. The implemented protocol is embarrassingly simple (200 LOC on top of an existing DAG-based mempool implementation (Narwhal & Tusk). It is highly efficient, achieving for example, 125,000 transaction per second with a 2 seconds latency for a deployment of 50 parties. In the same setting the state of the art pays a steep 50% latency increase as it optimizes for asynchrony.

preprint2022arXiv

Bullshark: The Partially Synchronous Version

The purpose of this manuscript is to describe the deterministic partially synchronous version of Bullshark in a simple and clean way. This result is published in CCS 2022, however, the description there is less clear because it uses the terminology of the full asynchronous Bullshark. The CCS version ties the description of the asynchronous and partially synchronous versions of Bullshark since it targets an academic audience. Due to the recent interest in DAG-based BFT protocols, we provide a separate and simple description of the partially synchronous version that targets a more general audience. We focus here on the DAG ordering logic. For more details about the asynchronous version, garbage collection, fairness, proofs, related work, evaluation, and efficient DAG implementation please refer to the fullpaper. An intuitive extended summary can be found in the "DAG meets BFT" blogpost.

preprint2022arXiv

Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus

We propose separating the task of reliable transaction dissemination from transaction ordering, to enable high-performance Byzantine fault-tolerant quorum-based consensus. We design and evaluate a mempool protocol, Narwhal, specializing in high-throughput reliable dissemination and storage of causal histories of transactions. Narwhal tolerates an asynchronous network and maintains high performance despite failures. Narwhal is designed to easily scale-out using multiple workers at each validator, and we demonstrate that there is no foreseeable limit to the throughput we can achieve. Composing Narwhal with a partially synchronous consensus protocol (Narwhal-HotStuff) yields significantly better throughput even in the presence of faults or intermittent loss of liveness due to asynchrony. However, loss of liveness can result in higher latency. To achieve overall good performance when faults occur we design Tusk, a zero-message overhead asynchronous consensus protocol, to work with Narwhal. We demonstrate its high performance under a variety of configurations and faults. As a summary of results, on a WAN, Narwhal-Hotstuff achieves over 130,000 tx/sec at less than 2-sec latency compared with 1,800 tx/sec at 1-sec latency for Hotstuff. Additional workers increase throughput linearly to 600,000 tx/sec without any latency increase. Tusk achieves 160,000 tx/sec with about 3 seconds latency. Under faults, both protocols maintain high throughput, but Narwhal-HotStuff suffers from increased latency.

preprint2021arXiv

Be Prepared When Network Goes Bad: An Asynchronous View-Change Protocol

The popularity of permissioned blockchain systems demands BFT SMR protocols that are efficient under good network conditions (synchrony) and robust under bad network conditions (asynchrony). The state-of-the-art partially synchronous BFT SMR protocols provide optimal linear communication cost per decision under synchrony and good leaders, but lose liveness under asynchrony. On the other hand, the state-of-the-art asynchronous BFT SMR protocols are live even under asynchrony, but always pay quadratic cost even under synchrony. In this paper, we propose a BFT SMR protocol that achieves the best of both worlds -- optimal linear cost per decision under good networks and leaders, optimal quadratic cost per decision under bad networks, and remains always live.

preprint2021arXiv

Using Nesting to Push the Limits of Transactional Data Structure Libraries

Transactional data structure libraries (TDSL) combine the ease-of-programming of transactions with the high performance and scalability of custom-tailored concurrent data structures. They can be very efficient thanks to their ability to exploit data structure semantics in order to reduce overhead, aborts, and wasted work compared to general-purpose software transactional memory. However, TDSLs were not previously used for complex use-cases involving long transactions and a variety of data structures. In this paper, we boost the performance and usability of a TDSL, towards allowing it to support complex applications. A key idea is nesting. Nested transactions create checkpoints within a longer transaction, so as to limit the scope of abort, without changing the semantics of the original transaction. We build a Java TDSL with built-in support for nested transactions over a number of data structures. We conduct a case study of a complex network intrusion detection system that invests a significant amount of work to process each packet. Our study shows that our library outperforms publicly available STMs twofold without nesting, and by up to 16x when nesting is used.

preprint2020arXiv

Cogsworth: Byzantine View Synchronization

Most methods for Byzantine fault tolerance (BFT) in the partial synchrony setting divide the local state of the nodes into views, and the transition from one view to the next dictates a leader change. In order to provide liveness, all honest nodes need to stay in the same view for a sufficiently long time. This requires \emph{view synchronization}, a requisite of BFT that we extract and formally define here. Existing approaches for Byzantine view synchronization incur quadratic communication (in $n$, the number of parties). A cascade of $O(n)$ view changes may thus result in $O(n^3)$ communication complexity. This paper presents a new Byzantine view synchronization algorithm named Cogsworth, that has optimistically linear communication complexity and constant latency. Faced with benign failures, Cogsworth has expected linear communication and constant latency. The result here serves as an important step towards reaching solutions that have overall quadratic communication, the known lower bound on Byzantine fault tolerant consensus. Cogsworth is particularly useful for a family of BFT protocols that already exhibit linear communication under various circumstances, but suffer quadratic overhead due to view synchronization.

preprint2020arXiv

Not a COINcidence: Sub-Quadratic Asynchronous Byzantine Agreement WHP

King and Saia were the first to break the quadratic word complexity bound for Byzantine Agreement in synchronous systems against an adaptive adversary, and Algorand broke this bound with near-optimal resilience (first in the synchronous model and then with eventual-synchrony). Yet the question of asynchronous sub-quadratic Byzantine Agreement remained open. To the best of our knowledge, we are the first to answer this question in the affirmative. A key component of our solution is a shared coin algorithm based on a VRF. A second essential ingredient is VRF-based committee sampling, which we formalize and utilize in the asynchronous model for the first time. Our algorithms work against a delayed-adaptive adversary, which cannot perform after-the-fact removals but has full control of Byzantine processes and full information about communication in earlier rounds. Using committee sampling and our shared coin, we solve Byzantine Agreement with high probability, with a word complexity of $\widetilde{O}(n)$ and $O(1)$ expected time, breaking the $O(n^2)$ bit barrier for asynchronous Byzantine Agreement.

preprint2016arXiv

Flexible Paxos: Quorum intersection revisited

Distributed consensus is integral to modern distributed systems. The widely adopted Paxos algorithm uses two phases, each requiring majority agreement, to reliably reach consensus. In this paper, we demonstrate that Paxos, which lies at the foundation of many production systems, is conservative. Specifically, we observe that each of the phases of Paxos may use non-intersecting quorums. Majority quorums are not necessary as intersection is required only across phases. Using this weakening of the requirements made in the original formulation, we propose Flexible Paxos, which generalizes over the Paxos algorithm to provide flexible quorums. We show that Flexible Paxos is safe, efficient and easy to utilize in existing distributed systems. We conclude by discussing the wide reaching implications of this result. Examples include improved availability from reducing the size of second phase quorums by one when the number of acceptors is even and utilizing small disjoint phase-2 quorums to speed up the steady-state.

preprint2015arXiv

On Liveness of Dynamic Storage

Dynamic distributed storage algorithms such as DynaStore, Reconfigurable Paxos, RAMBO, and RDS, do not ensure liveness (wait-freedom) in asynchronous runs with infinitely many reconfigurations. We prove that this is inherent for asynchronous dynamic storage algorithms, including ones that use $Ω$ or $\diamond S$ oracles. Our result holds even if only one process may fail, provided that machines that were successfully removed from the system's configuration may be switched off by an administrator. Intuitively, the impossibility relies on the fact that a correct process can be suspected to have failed at any time, i.e., its failure is indistinguishable to other processes from slow delivery of its messages, and so the system should be able to reconfigure without waiting for this process to complete its pending operations. To circumvent this result, we define a dynamic eventually perfect failure detector, and present an algorithm that uses it to emulate wait-free dynamic atomic storage (with no restrictions on reconfiguration rate). Together, our results thus draw a sharp line between oracles like $Ω$ and $\diamond S$, which allow some correct process to continue to be suspected forever, and a dynamic eventually perfect one, which does not.

preprint2015arXiv

Space Bounds for Reliable Multi-Writer Data Store: Inherent Cost of Read/Write Primitives

Reliable storage emulations from fault-prone components have established themselves as an algorithmic foundation of modern storage services and applications. Most existing reliable storage emulations are built from storage services supporting arbitrary read-modify-write primitives. Since such primitives are not typically exposed by pre-existing or off-the-shelf components (such as cloud storage services or network-attached disks) it is natural to ask if they are indeed essential for efficient storage emulations. In this paper, we answer this question in the affirmative. We show that relaxing the underlying storage to only support read/write operations leads to a linear blow-up in the emulation space requirements. We also show that the space complexity is not adaptive to concurrency, which implies that the storage cannot be reliably reclaimed even in sequential runs. On a positive side, we show that Compare-and-Swap primitives, which are commonly available with many off-the-shelf storage services, can be used to emulate a reliable multi-writer atomic register with constant storage and adaptive time complexity.

preprint2015arXiv

Space Bounds for Reliable Storage: Fundamental Limits of Coding

We study the inherent space requirements of shared storage algorithms in asynchronous fault-prone systems. Previous works use codes to achieve a better storage cost than the well-known replication approach. However, a closer look reveals that they incur extra costs somewhere else: Some use unbounded storage in communication links, while others assume bounded concurrency or synchronous periods. We prove here that this is inherent, and indeed, if there is no bound on the concurrency level, then the storage cost of any reliable storage algorithm is at least f+1 times the data size, where f is the number of tolerated failures. We further present a technique for combining erasure-codes with full replication so as to obtain the best of both. We present a storage algorithm whose storage cost is close to the lower bound in the worst case, and adapts to the concurrency level.

Alexander Spiegelman

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing

Bullshark: DAG BFT Protocols Made Practical

Bullshark: The Partially Synchronous Version

Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus

Be Prepared When Network Goes Bad: An Asynchronous View-Change Protocol

Using Nesting to Push the Limits of Transactional Data Structure Libraries

Cogsworth: Byzantine View Synchronization

Not a COINcidence: Sub-Quadratic Asynchronous Byzantine Agreement WHP

Flexible Paxos: Quorum intersection revisited

On Liveness of Dynamic Storage

Space Bounds for Reliable Multi-Writer Data Store: Inherent Cost of Read/Write Primitives

Space Bounds for Reliable Storage: Fundamental Limits of Coding