Researcher profile

Gene Cooperman

Gene Cooperman contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2020arXiv

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer's burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process's memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations.

preprint2020arXiv

Sthread: In-Vivo Model Checking of Multithreaded Programs

This work strives to make formal verification of POSIX multithreaded programs easily accessible to general programmers. Sthread operates directly on multithreaded C/C++ programs, without the need for an intermediate formal model. Sthread is in-vivo in that it provides a drop-in replacement for the pthread library, and operates directly on the compiled target executable and application libraries. There is no compiler-generated intermediate representation. The system calls in the application remain unaltered. Optionally, the programmer can add a small amount of additional native C code to include assertions based on the user's algorithm, declarations of shared memory regions, and progress/liveness conditions. The work has two important motivations: (i) It can be used to verify correctness of a concurrent algorithm being implemented with multithreading; and (ii) it can also be used pedagogically to provide immediate feedback to students learning either to employ POSIX threads system calls or to implement multithreaded algorithms. This work represents the first example of in-vivo model checking operating directly on the standard multithreaded executable and its libraries, without the aid of a compiler-generated intermediate representation. Sthread leverages the open-source SimGrid libraries, and will eventually be integrated into SimGrid. Sthread employs a non-preemptive model in which thread context switches occur only at multithreaded system calls (e.g., mutex, semaphore) or before accesses to shared memory regions. The emphasis is on finding "algorithmic bugs" (bugs in an original algorithm, implemented as POSIX threads and shared memory regions. This work is in contrast to Context-Bounded Analysis (CBA), which assumes a preemptive model for threads, and emphasizes implementation bugs such as buffer overruns and write-after-free for memory allocation. In particular, the Sthread in-vivo approach has strong future potential for pedagogy, by providing immediate feedback to students who are first learning the correct use of Pthreads system calls in implementation of concurrent algorithms based on multithreading.

preprint2018arXiv

M2: Malleable Metal as a Service

Existing bare-metal cloud services that provide users with physical nodes have a number of serious disadvantage over their virtual alternatives, including slow provisioning times, difficulty for users to release nodes and then reuse them to handle changes in demand, and poor tolerance to failures. We introduce M2, a bare-metal cloud service that uses network-mounted boot drives to overcome these disadvantages. We describe the architecture and implementation of M2 and compare its agility, scalability, and performance to existing systems. We show that M2 can reduce provisioning time by over 50% while offering richer functionality, and comparable run-time performance with respect to tools that provision images into local disks. M2 is open source and available at https://github.com/CCI-MOC/ims.

preprint2012arXiv

A Generic Checkpoint-Restart Mechanism for Virtual Machines

It is common today to deploy complex software inside a virtual machine (VM). Snapshots provide rapid deployment, migration between hosts, dependability (fault tolerance), and security (insulating a guest VM from the host). Yet, for each virtual machine, the code for snapshots is laboriously developed on a per-VM basis. This work demonstrates a generic checkpoint-restart mechanism for virtual machines. The mechanism is based on a plugin on top of an unmodified user-space checkpoint-restart package, DMTCP. Checkpoint-restart is demonstrated for three virtual machines: Lguest, user-space QEMU, and KVM/QEMU. The plugins for Lguest and KVM/QEMU require just 200 lines of code. The Lguest kernel driver API is augmented by 40 lines of code. DMTCP checkpoints user-space QEMU without any new code. KVM/QEMU, user-space QEMU, and DMTCP need no modification. The design benefits from other DMTCP features and plugins. Experiments demonstrate checkpoint and restart in 0.2 seconds using forked checkpointing, mmap-based fast-restart, and incremental Btrfs-based snapshots.

preprint2012arXiv

FReD: Automated Debugging via Binary Search through a Process Lifetime

Reversible debuggers have been developed at least since 1970. Such a feature is useful when the cause of a bug is close in time to the bug manifestation. When the cause is far back in time, one resorts to setting appropriate breakpoints in the debugger and beginning a new debugging session. For these cases when the cause of a bug is far in time from its manifestation, bug diagnosis requires a series of debugging sessions with which to narrow down the cause of the bug. For such "difficult" bugs, this work presents an automated tool to search through the process lifetime and locate the cause. As an example, the bug could be related to a program invariant failing. A binary search through the process lifetime suffices, since the invariant expression is true at the beginning of the program execution, and false when the bug is encountered. An algorithm for such a binary search is presented within the FReD (Fast Reversible Debugger) software. It is based on the ability to checkpoint, restart and deterministically replay the multiple processes of a debugging session. It is based on GDB (a debugger), DMTCP (for checkpoint-restart), and a custom deterministic record-replay plugin for DMTCP. FReD supports complex, real-world multithreaded programs, such as MySQL and Firefox. Further, the binary search is robust. It operates on multi-threaded programs, and takes advantage of multi-core architectures during replay.

preprint2011arXiv

A Bit-Compatible Shared Memory Parallelization for ILU(k) Preconditioning and a Bit-Compatible Generalization to Distributed Memory

ILU(k) is a commonly used preconditioner for iterative linear solvers for sparse, non-symmetric systems. It is often preferred for the sake of its stability. We present TPILU(k), the first efficiently parallelized ILU(k) preconditioner that maintains this important stability property. Even better, TPILU(k) preconditioning produces an answer that is bit-compatible with the sequential ILU(k) preconditioning. In terms of performance, the TPILU(k) preconditioning is shown to run faster whenever more cores are made available to it --- while continuing to be as stable as sequential ILU(k). This is in contrast to some competing methods that may become unstable if the degree of thread parallelism is raised too far. Where Block Jacobi ILU(k) fails in an application, it can be replaced by TPILU(k) in order to maintain good performance, while also achieving full stability. As a further optimization, TPILU(k) offers an optional level-based incomplete inverse method as a fast approximation for the original ILU(k) preconditioned matrix. Although this enhancement is not bit-compatible with classical ILU(k), it is bit-compatible with the output from the single-threaded version of the same algorithm. In experiments on a 16-core computer, the enhanced TPILU(k)-based iterative linear solver performed up to 9 times faster. As we approach an era of many-core computing, the ability to efficiently take advantage of many cores will become ever more important. TPILU(k) also demonstrates good performance on cluster or Grid. For example, the new algorithm achieves 50 times speedup with 80 nodes for general sparse matrices of dimension 160,000 that are diagonally dominant.

preprint2011arXiv

Finding the Minimal DFA of Very Large Finite State Automata with an Application to Token Passing Networks

Finite state automata (FSA) are ubiquitous in computer science. Two of the most important algorithms for FSA processing are the conversion of a non-deterministic finite automaton (NFA) to a deterministic finite automaton (DFA), and then the production of the unique minimal DFA for the original NFA. We exhibit a parallel disk-based algorithm that uses a cluster of 29 commodity computers to produce an intermediate DFA with almost two billion states and then continues by producing the corresponding unique minimal DFA with less than 800,000 states. The largest previous such computation in the literature was carried out on a 512-processor CM-5 supercomputer in 1996. That computation produced an intermediate DFA with 525,000 states and an unreported number of states for the corresponding minimal DFA. The work is used to provide strong experimental evidence satisfying a conjecture on a series of token passing networks. The conjecture concerns stack sortable permutations for a finite stack and a 3-buffer. The origins of this problem lie in the work on restricted permutations begun by Knuth and Tarjan in the late 1960s. The parallel disk-based computation is also compared with both a single-threaded and multi-threaded RAM-based implementation using a 16-core 128 GB large shared memory computer.