Researcher profile

Jiajun Cao

Jiajun Cao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 11 - UnverifiedVerification L1Unclaimed author
1works
0followers
2topics
3close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

1 published item(s)

preprint2014arXiv

Transparent Checkpoint-Restart over InfiniBand

InfiniBand is widely used for low-latency, high-throughput cluster computing. Saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. Because of a lack of a solution, typical MPI implementations have included custom checkpoint-restart services that "tear down" the network, checkpoint each node as if the node were a standalone computer, and then re-connect the network again. We present the first example of transparent, system-initiated checkpoint-restart that directly supports InfiniBand. The new approach is independent of any particular Linux kernel, thus simplifying the current practice of using a kernel-based module, such as BLCR. This direct approach results in checkpoints that are found to be faster than with the use of a checkpoint-restart service. The generality of this approach is shown not only by checkpointing an MPI computation, but also a native UPC computation (Berkeley Unified Parallel C), which does not use MPI. Scalability is shown by checkpointing 2,048 MPI processes across 128 nodes (with 16 cores per node). In addition, a cost-effective debugging approach is also enabled, in which a checkpoint image from an InfiniBand-based production cluster is copied to a local Ethernet-based cluster, where it can be restarted and an interactive debugger can be attached to it. This work is based on a plugin that extends the DMTCP (Distributed MultiThreaded CheckPointing) checkpoint-restart package.