Researcher profile

Franck Cappello

Franck Cappello contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

FFCz: Fast Fourier Correction for Spectrum-Preserving Lossy Compression of Scientific Data

This paper introduces a novel technique to preserve spectral features in lossy compression based on a novel fast Fourier correction algorithm\added{ for regular-grid data}. Preserving both spatial and frequency representations of data is crucial for applications such as cosmology, turbulent combustion, and X-ray diffraction, where spatial and frequency views provide complementary scientific insights. In particular, many analysis tasks rely on frequency-domain representations to capture key features, including the power spectrum of cosmology simulations, the turbulent energy spectrum in combustion, and diffraction patterns in reciprocal space for ptychography. However, existing compression methods guarantee accuracy only in the spatial domain while disregarding the frequency domain. To address this limitation, we propose an algorithm that corrects the errors produced by off-the-shelf ``base'' compressors such as SZ3, ZFP, and SPERR, thereby preserving both spatial and frequency representations by bounding errors in both domains. By expressing frequency-domain errors as linear combinations of spatial-domain errors, we derive a region that jointly bounds errors in both domains. Given as input the spatial errors from a base compressor and user-defined error bounds in the spatial and frequency domains, we iteratively project the spatial error vector onto the regions defined by the spatial and frequency constraints until it lies within their intersection. We further accelerate the algorithm using GPU parallelism to achieve practical performance. We validate our approach with datasets from cosmology simulations, X-ray diffraction, combustion simulation, and electroencephalography demonstrating its effectiveness in preserving critical scientific information in both spatial and frequency domains.

preprint2026arXiv

pMSz: A Distributed Parallel Algorithm for Correcting Extrema and Morse Smale Segmentations in Lossy Compression

Lossy compression, widely used by scientists to reduce data from simulations, experiments, and observations, can distort features of interest even under bounded error. Such distortions may compromise downstream analyses and lead to incorrect scientific conclusions in applications such as combustion and cosmology. This paper presents a distributed and parallel algorithm for correcting topological features, specifically, piecewise linear Morse Smale segmentations (PLMSS), which decompose the domain into monotone regions labeled by their corresponding local minima and maxima. While a single GPU algorithm (MSz) exists for PLMSS correction after compression, no methodology has been developed that scales beyond a single GPU for extreme scale data. We identify the key bottleneck in scaling PLMSS correction as the parallel computation of integral paths, a communication-intensive computation that is notoriously difficult to scale. Instead of explicitly computing and correcting integral paths, our algorithm simplifies MSz by preserving steepest ascending and descending directions across all locations, thereby minimizing interprocess communication while introducing negligible additional storage overhead. With this simplified algorithm and relaxed synchronization, our method achieves over 90% parallel efficiency on 128 GPUs on the Perlmutter supercomputer for real world datasets.

preprint2026arXiv

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

preprint2022arXiv

Accelerating Parallel Write via Deeply Integrating Predictive Lossy Compression with HDF5

Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to significantly improve the parallel-write performance. Specifically, we propose analytical models to predict the time of compression and parallel write before the actual compression to enable compression-write overlapping. We also introduce an extra space in the process to handle possible data overflows resulting from prediction uncertainty in compression ratios. Moreover, we propose an optimization to reorder the compression tasks to increase the overlapping efficiency. Experiments with up to 4,096 cores from Summit show that our solution improves the write performance by up to 4.5X and 2.9X over the non-compression and lossy compression solutions, respectively, with only 1.5% storage overhead (compared to original data) on two real-world HPC applications.

preprint2022arXiv

Improving Prediction-Based Lossy Compression Dramatically via Ratio-Quality Modeling

Error-bounded lossy compression is one of the most effective techniques for scientific data reduction. However, the traditional trial-and-error approach used to configure lossy compressors for finding the optimal trade-off between reconstructed data quality and compression ratio is prohibitively expensive. To resolve this issue, we develop a general-purpose analytical ratio-quality model based on the prediction-based lossy compression framework, which can effectively foresee the reduced data quality and compression ratio, as well as the impact of the lossy compressed data on post-hoc analysis quality. Our analytical model significantly improves the prediction-based lossy compression in three use-cases: (1) optimization of predictor by selecting the best-fit predictor; (2) memory compression with a target ratio; and (3) in-situ compression optimization by fine-grained error-bound tuning of various data partitions. We evaluate our analytical model on 10 scientific datasets, demonstrating its high accuracy (93.47% accuracy on average) and low computational cost (up to 18.7X lower than the trial-and-error approach) for estimating the compression ratio and the impact of lossy compression on post-hoc analysis quality. We also verified the high efficiency of our ratio-quality model using different applications across the three use-cases. In addition, the experiment demonstrates that our modeling based approach reduces the time to store the 3D Reverse Time Migration data by up to 3.4X over the traditional solution using 128 CPU cores from 8 compute nodes.

preprint2022arXiv

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step -- a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-of-the-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64X over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43X on average.

preprint2022arXiv

ROIBIN-SZ: Fast and Science-Preserving Compression for Serial Crystallography

Crystallography is the leading technique to study atomic structures of proteins and produces enormous volumes of information that can place strains on the storage and data transfer capabilities of synchrotron and free-electron laser light sources. Lossy compression has been identified as a possible means to cope with the growing data volumes; however, prior approaches have not produced sufficient quality at a sufficient rate to meet scientific needs. This paper presents Region Of Interest BINning with SZ lossy compression (ROIBIN-SZ) a novel, parallel, and accelerated compression scheme that separates the dynamically selected preservation of key regions with lossy compression of background information. We perform and present an extensive evaluation of the performance and quality results made by the co-design of this compression scheme. We can achieve up to a 196x and 46.44x compression ratio on lysozyme and selenobiotinyl-streptavidin while preserving the data sufficiently to reconstruct the structure at bandwidths and scales that approach the needs of the upcoming light sources

preprint2022arXiv

SIMD Lossy Compression for Scientific Data

Modern HPC applications produce increasingly large amounts of data, which limits the performance of current extreme-scale systems. Data reduction techniques, such as lossy compression, help to mitigate this issue by decreasing the size of data generated by these applications. SZ, a current state-of-the-art lossy compressor, is able to achieve high compression ratios, but the prediction/quantization methods used introduce dependencies which prevent parallelizing this step of the compression. Recent work proposes a parallel dual prediction/quantization algorithm for GPUs which removes these dependencies. However, some HPC systems and applications do not use GPUs, and could still benefit from the fine-grained parallelism of this method. Using the dual-quantization technique, we implement and optimize a SIMD vectorized CPU version of SZ, and create a heuristic for selecting the optimal block size and vector length. We also investigate the effect of non-zero block padding values to decrease the number of unpredictable values along compression block borders. We measure performance of vecSZ against an O3 optimized CPU version of SZ using dual-quantization, pSZ, as well as SZ-1.4. We evaluate our vectorized version, vecSZ, on the Intel Skylake and AMD Rome architectures using real-world scientific datasets. We find that applying alternative padding reduces the number of outliers by 100\% for some configurations. Our implementation also results in up to 32\% improvement in rate-distortion and up to 15$\times$ speedup over SZ-1.4, achieving a prediction and quantization bandwidth in excess of 3.4 GB/s.

preprint2022arXiv

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Today's scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but it can also strictly control the data distortion based on the use-specified error bound. Existing lossy compressors, however, cannot offer ultra-fast compression speed, which is highly demanded by quite a few applications or use-cases (such as in-memory compression and online instrument data compression). In this paper, we propose a novel ultra-fast error-bounded lossy compressor, which can obtain fairly high compression performance on both CPU and GPU, also with reasonably high compression ratios. The key contributions are three-fold: (1) We propose a novel, generic ultra-fast error-bounded lossy compression framework -- called UFZ, by confining our design to be composed of only super-lightweight operations such as bitwise and addition/subtraction operation, still keeping a certain high compression ratio. (2) We implement UFZ on both CPU and GPU and optimize the performance according to their architectures carefully. (3) We perform a comprehensive evaluation with 6 real-world production-level scientific datasets on both CPU and GPU. Experiments show that UFZ is 2~16X as fast as the second-fastest existing error-bounded lossy compressor (either SZ or ZFP) on CPU and GPU, with respect to both compression and decompression.

preprint2021arXiv

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0X and 6.8X on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3X over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs.

preprint2021arXiv

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale

Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As modern HPC infrastructures continue to evolve, there is a growing gap between compute capacity vs. I/O capabilities. Furthermore, the storage hierarchy is becoming increasingly heterogeneous: in addition to parallel file systems, it comprises burst buffers, key-value stores, deep memory hierarchies at node level, etc. In this context, state of art is insufficient to deal with the diversity of vendor APIs, performance and persistency characteristics. This extended abstract presents an overview of VeloC (Very Low Overhead Checkpointing System), a checkpointing runtime specifically design to address these challenges for the next generation Exascale HPC applications and systems. VeloC offers a simple API at user level, while employing an advanced multi-level resilience strategy that transparently optimizes the performance and scalability of checkpointing by leveraging heterogeneous storage.

preprint2020arXiv

cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data

Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. (1) We propose a dual-quantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4) We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1x and 13.1x, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of reconstructed data. It also improves the compression ratio by up to 3.48x on the tested data compared with another state-of-the-art GPU supported lossy compressor.

preprint2020arXiv

FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data

With ever-increasing volumes of scientific floating-point data being produced by high-performance computing applications, significantly reducing scientific floating-point data size is critical, and error-controlled lossy compressors have been developed for years. None of the existing scientific floating-point lossy data compressors, however, support effective fixed-ratio lossy compression. Yet fixed-ratio lossy compression for scientific floating-point data not only compresses to the requested ratio but also respects a user-specified error bound with higher fidelity. In this paper, we present FRaZ: a generic fixed-ratio lossy compression framework respecting user-specified error constraints. The contribution is twofold. (1) We develop an efficient iterative approach to accurately determine the appropriate error settings for different lossy compressors based on target compression ratios. (2) We perform a thorough performance and accuracy evaluation for our proposed fixed-ratio compression framework with multiple state-of-the-art error-controlled lossy compressors, using several real-world scientific floating-point datasets from different domains. Experiments show that FRaZ effectively identifies the optimum error setting in the entire error setting space of any given lossy compressor. While fixed-ratio lossy compression is slower than fixed-error compression, it provides an important new lossy compression technique for users of very large scientific floating-point datasets.

preprint2020arXiv

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly.Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%~8% in both error-free and error-injected situations).

preprint2020arXiv

Full-State Quantum Circuit Simulation by Using Data Compression

Quantum circuit simulations are critical for evaluating quantum algorithms and machines. However, the number of state amplitudes required for full simulation increases exponentially with the number of qubits. In this study, we leverage data compression to reduce memory requirements, trading computation time and fidelity for memory space. Specifically, we develop a hybrid solution by combining the lossless compression and our tailored lossy compression method with adaptive error bounds at each timestep of the simulation. Our approach optimizes for compression speed and makes sure that errors due to lossy compression are uncorrelated, an important property for comparing simulation output with physical machines. Experiments show that our approach reduces the memory requirement of simulating the 61-qubit Grover's search algorithm from 32 exabytes to 768 terabytes of memory on Argonne's Theta supercomputer using 4,096 nodes. The results suggest that our techniques can increase the simulation size by 2 to 16 qubits for general quantum circuits.