Source author record

Ian Foster

Ian Foster appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

31works

22topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Large language models (LLMs) are rapidly changing how researchers in materials science and chemistry discover, organize, and act on scientific knowledge. This paper analyzes a broad set of community-developed LLM applications in an effort to identify emerging patterns in how these systems can be used across the scientific research lifecycle. We organize the projects into two complementary categories: Knowledge Infrastructure, systems that structure, retrieve, synthesize, and validate scientific information; and Action Systems, systems that execute, coordinate, or automate scientific work across computational and experimental environments. The submissions reveal a shift from single-purpose LLM tools toward integrated, multi-agent workflows that combine retrieval, reasoning, tool use, and domain-specific validation. Prominent themes include retrieval-augmented generation as grounding infrastructure, persistent structured knowledge representations, multimodal and multilingual scientific inputs, and early progress toward laboratory-integrated closed-loop systems. Together, these results suggest that LLMs are evolving from general-purpose assistants into composable infrastructure for scientific reasoning and action. This work provides a community snapshot of that transition and a practical taxonomy for understanding emerging LLM-enabled workflows in materials science and chemistry.

preprint2026arXiv

PRISM: Protocol Refinement through Intelligent Simulation Modeling

Automating experimental protocol design and execution remains as a fundamental bottleneck in realizing self-driving laboratories. We introduce PRISM (Protocol Refinement through Intelligent Simulation Modeling), a framework that automates the design, validation, and execution of experimental protocols on a laboratory platform composed of off-the-shelf robotic instruments. PRISM uses a set of language-model-based agents that work together to generate and refine experimental steps. The process begins with automatically gathering relevant procedures from web-based sources describing experimental workflows. These are converted into structured experimental steps (e.g., liquid handling steps, deck layout and other related operations) through a planning, critique, and validation loop. The finalized steps are translated into the Argonne MADSci protocol format, which provides a unified interface for coordinating multiple robotic instruments (Opentrons OT-2 liquid handler, PF400 arm, Azenta plate sealer and peeler) without requiring human intervention between steps. To evaluate protocol-generation performance, we benchmarked both single reasoning models and multi-agent workflow across constrained and open-ended prompting paradigms. The resulting protocols were validated in a digital-twin environment built in NVIDIA Omniverse to detect physical or sequencing errors before execution. Using Luna qPCR amplification and Cell Painting as case studies, we demonstrate PRISM as a practical end-to-end workflow that bridges language-based protocol generation, simulation-based validation, and automated robotic execution.

preprint2022arXiv

Bridging Data Center AI Systems with Edge Computing for Actionable Information Retrieval

Extremely high data rates at modern synchrotron and X-ray free-electron laser light source beamlines motivate the use of machine learning methods for data reduction, feature detection, and other purposes. Regardless of the application, the basic concept is the same: data collected in early stages of an experiment, data from past similar experiments, and/or data simulated for the upcoming experiment are used to train machine learning models that, in effect, learn specific characteristics of those data; these models are then used to process subsequent data more efficiently than would general-purpose models that lack knowledge of the specific dataset or data class. Thus, a key challenge is to be able to train models with sufficient rapidity that they can be deployed and used within useful timescales. We describe here how specialized data center AI (DCAI) systems can be used for this purpose through a geographically distributed workflow. Experiments show that although there are data movement cost and service overhead to use remote DCAI systems for DNN training, the turnaround time is still less than 1/30 of using a locally deploy-able GPU.

preprint2022arXiv

Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulations

Machine learning (ML)-based steering can improve the performance of ensemble-based simulations by allowing for online selection of more scientifically meaningful computations. We present DeepDriveMD, a framework for ML-driven steering of scientific simulations that we have used to achieve orders-of-magnitude improvements in molecular dynamics (MD) performance via effective coupling of ML and HPC on large parallel computers. We discuss the design of DeepDriveMD and characterize its performance. We demonstrate that DeepDriveMD can achieve between 100-1000x acceleration for protein folding simulations relative to other methods, as measured by the amount of simulated time performed, while covering the same conformational landscape as quantified by the states sampled during a simulation. Experiments are performed on leadership-class platforms on up to 1020 nodes. The results establish DeepDriveMD as a high-performance framework for ML-driven HPC simulation scenarios, that supports diverse MD simulation and ML back-ends, and which enables new scientific insights by improving the length and time scales accessible with current computing capacity.

preprint2022arXiv

CUF-Links: Continuous and Ubiquitous FAIRness Linkages for reproducible research

Despite much creative work on methods and tools, reproducibility -- the ability to repeat the computational steps used to obtain a research result -- remains elusive. One reason for these difficulties is that extant tools for capturing research processes do not align well with the rich working practices of scientists. We advocate here for simple mechanisms that can be integrated easily with current work practices to capture basic information about every data product consumed or produced in a project. We argue that by thus extending the scope of findable, accessible, interoperable, and reusable (FAIR) data in both time and space to enable the creation of a continuous chain of continuous and ubiquitous FAIRness linkages (CUF-Links) from inputs to outputs, such mechanisms can provide a strong foundation for documenting the provenance linkages that are essential to reproducible research. We give examples of mechanisms that can achieve these goals, and review how they have been applied in practice.

preprint2022arXiv

Extended Abstract: Productive Parallel Programming with Parsl

Parsl is a parallel programming library for Python that aims to make it easy to specify parallelism in programs and to realize that parallelism on arbitrary parallel and distributed computing systems. Parsl relies on developers annotating Python functions-wrapping either Python or external applications-to indicate that these functions may be executed concurrently. Developers can then link together functions via the exchange of data. Parsl establishes a dynamic dependency graph and sends tasks for execution on connected resources when dependencies are resolved. Parsl's runtime system enables different compute resources to be used, from laptops to supercomputers, without modification to the Parsl program.

preprint2022arXiv

fairDMS: Rapid Model Training by Data and Model Reuse

Extracting actionable information rapidly from data produced by instruments such as the Linac Coherent Light Source (LCLS-II) and Advanced Photon Source Upgrade (APS-U) is becoming ever more challenging due to high (up to TB/s) data rates. Conventional physics-based information retrieval methods are hard-pressed to detect interesting events fast enough to enable timely focusing on a rare event or correction of an error. Machine learning~(ML) methods that learn cheap surrogate classifiers present a promising alternative, but can fail catastrophically when changes in instrument or sample result in degradation in ML performance. To overcome such difficulties, we present a new data storage and ML model training architecture designed to organize large volumes of data and models so that when model degradation is detected, prior models and/or data can be queried rapidly and a more suitable model retrieved and fine-tuned for new conditions. We show that our approach can achieve up to 100x data labelling speedup compared to the current state-of-the-art, 200x improvement in training speed, and 92x speedup in-terms of end-to-end model updating time.

preprint2022arXiv

Linking Scientific Instruments and HPC: Patterns, Technologies, Experiences

Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Such online analyses require methods for configuring and running high-performance distributed computing pipelines--what we call flows--linking instruments, HPC (e.g., for analysis, simulation, AI model training), edge computing (for analysis), data stores, metadata catalogs, and high-speed networks. In this article, we review common patterns associated with such flows and describe methods for instantiating those patterns. We also present experiences with the application of these methods to the processing of data from five different scientific instruments, each of which engages HPC resources for data inversion, machine learning model training, or other purposes. We also discuss implications of these new methods for operators and users of scientific facilities.

preprint2022arXiv

RADICAL-Pilot and Parsl: Executing Heterogeneous Workflows on HPC Platforms

Workflows applications are becoming increasingly important to support scientific discovery. That is leading to a proliferation of workflow management systems and, thus, to a fragmented software ecosystem. Integration among existing workflow tools can improve development efficiency and, ultimately, increase the sustainability of scientific workflow software. We describe our experience with integrating RADICAL-Pilot (RP) and Parsl as a way to enable users to develop and execute workflow applications with heterogeneous tasks on heterogeneous high-performance computing resources. We describe our approach to the integration of the two systems and detail the development of RPEX, a Parsl executor which uses RP as its workload manager. We develop an RP executor that executes heterogeneous MPI Python functions on CPU cores and GPUs. We measure the weak and strong scaling of RPEX, RP, and Parsl when providing new capabilities to two paradigmatic use cases: Colmena and Ice Wedge Polygons

preprint2022arXiv

Real-Time Streaming and Event-driven Control of Scientific Experiments

Advancements in scientific instrument sensors and connected devices provide unprecedented insight into ongoing experiments and present new opportunities for control, optimization, and steering. However, the diversity of sensors and heterogeneity of their data result in make it challenging to fully realize these new opportunities. Organizing and synthesizing diverse data streams in near-real-time requires both rich automation and Machine Learning (ML). To efficiently utilize ML during an experiment, the entire ML lifecycle must be addressed, including refining experiment configurations, retraining models, and applying decisions-tasks that require an equally diverse array of computational resources spanning centralized HPC to the accelerators at the edge. Here we present the Manufacturing Data and Machine Learning platform (MDML). The MDML is designed to standardize the research and operational environment for advanced data analytics and ML-enabled automated process optimization by providing the cyberinfrastructure to integrate sensor data streams and AI in cyber-physical systems for in-situ analysis. To achieve this, the MDML provides a fabric to receive and aggregate IoT data and simultaneously orchestrate remote computation across the computing continuum. In this paper we describe the MDML and show how it is used in advanced manufacturing to act on IoT data and orchestrate distributed ML to guide experiments.

preprint2022arXiv

Sharing Begins at Home

The broad sharing of research data is widely viewed as of critical importance for the speed, quality, accessibility, and integrity of science. Despite increasing efforts to encourage data sharing, both the quality of shared data, and the frequency of data reuse, remain stubbornly low. We argue here that a major reason for this unfortunate state of affairs is that the organization of research results in the findable, accessible, interoperable, and reusable (FAIR) form required for reuse is too often deferred to the end of a research project, when preparing publications, by which time essential details are no longer accessible. Thus, we propose an approach to research informatics that applies FAIR principles continuously, from the very inception of a research project, and ubiquitously, to every data asset produced by experiment or computation. We suggest that this seemingly challenging task can be made feasible by the adoption of simple tools, such as lightweight identifiers (to ensure that every data asset is findable), packaging methods (to facilitate understanding of data contents), data access methods, and metadata organization and structuring tools (to support schema development and evolution). We use an example from experimental neuroscience to illustrate how these methods can work in practice.

preprint2022arXiv

The History of the Grid

With the widespread availability of high-speed networks, it becomes feasible to outsource computing to remote providers and to federate resources from many locations. Such observations motivated the development, from the mid-1990s onwards, of a range of innovative Grid technologies, applications, and infrastructures. We review the history, current status, and future prospects for Grid computing.

preprint2022arXiv

Ultrafast Focus Detection for Automated Microscopy

Technological advancements in modern scientific instruments, such as scanning electron microscopes (SEMs), have significantly increased data acquisition rates and image resolutions enabling new questions to be explored; however, the resulting data volumes and velocities, combined with automated experiments, are quickly overwhelming scientists as there remain crucial steps that require human intervention, for example reviewing image focus. We present a fast out-of-focus detection algorithm for electron microscopy images collected serially and demonstrate that it can be used to provide near-real-time quality control for neuroscience workflows. Our technique, \textit{Multi-scale Histologic Feature Detection}, adapts classical computer vision techniques and is based on detecting various fine-grained histologic features. We exploit the inherent parallelism in the technique to employ GPU primitives in order to accelerate characterization. We show that our method can detect of out-of-focus conditions within just 20ms. To make these capabilities generally available, we deploy our feature detector as an on-demand service and show that it can be used to determine the degree of focus in approximately 230ms, enabling near-real-time use.

preprint2021arXiv

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Rapid growth in scientific data and a widening gap between computational speed and I/O bandwidth make it increasingly infeasible to store and share all data produced by scientific simulations. Instead, we need methods for reducing data volumes: ideally, methods that can scale data volumes adaptively so as to enable negotiation of performance and fidelity tradeoffs in different situations. Multigrid-based hierarchical data representations hold promise as a solution to this problem, allowing for flexible conversion between different fidelities so that, for example, data can be created at high fidelity and then transferred or stored at lower fidelity via logically simple and mathematically sound operations. However, the effective use of such representations has been hindered until now by the relatively high costs of creating, accessing, reducing, and otherwise operating on such representations. We describe here highly optimized data refactoring kernels for GPU accelerators that enable efficient creation and manipulation of data in multigrid-based hierarchical forms. We demonstrate that our optimized design can achieve up to 250 TB/s aggregated data refactoring throughput -- 83% of theoretical peak -- on 1024 nodes of the Summit supercomputer. We showcase our optimized design by applying it to a large-scale scientific visualization workflow and the MGARD lossy compression software.

preprint2021arXiv

AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of coronavirus research. We report here on a project that leverages both human and artificial intelligence to detect references to drug-like molecules in free text. We engage non-expert humans to create a corpus of labeled text, use this labeled corpus to train a named entity recognition model, and employ the trained model to extract 10912 drug-like molecules from the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198875 papers. Performance analyses show that our automated extraction model can achieve performance on par with that of non-expert humans.

preprint2021arXiv

Fast and accurate learned multiresolution dynamical downscaling for precipitation

This study develops a neural network-based approach for emulating high-resolution modeled precipitation data with comparable statistical properties but at greatly reduced computational cost. The key idea is to use combination of low- and high- resolution simulations to train a neural network to map from the former to the latter. Specifically, we define two types of CNNs, one that stacks variables directly and one that encodes each variable before stacking, and we train each CNN type both with a conventional loss function, such as mean square error (MSE), and with a conditional generative adversarial network (CGAN), for a total of four CNN variants. We compare the four new CNN-derived high-resolution precipitation results with precipitation generated from original high resolution simulations, a bilinear interpolater and the state-of-the-art CNN-based super-resolution (SR) technique. Results show that the SR technique produces results similar to those of the bilinear interpolator with smoother spatial and temporal distributions and smaller data variabilities and extremes than the original high resolution simulations. While the new CNNs trained by MSE generate better results over some regions than the interpolator and SR technique do, their predictions are still not as close as the original high resolution simulations. The CNNs trained by CGAN generate more realistic and physically reasonable results, better capturing not only data variability in time and space but also extremes such as intense and long-lasting storms. The new proposed CNN-based downscaling approach can downscale precipitation from 50~km to 12~km in 14~min for 30~years once the network is trained (training takes 4~hours using 1~GPU), while the conventional dynamical downscaling would take 1~month using 600 CPU cores to generate simulations at the resolution of 12~km over contiguous United States.

preprint2020arXiv

Big Data Staging with MPI-IO for Interactive X-ray Science

New techniques in X-ray scattering science experiments produce large data sets that can require millions of high-performance processing hours per week of computation for analysis. In such applications, data is typically moved from X-ray detectors to a large parallel file system shared by all nodes of a petascale supercomputer and then is read repeatedly as different science application tasks proceed. However, this straightforward implementation causes significant contention in the file system. We propose an alternative approach in which data is instead staged into and cached in compute node memory for extended periods, during which time various processing tasks may efficiently access it. We describe here such a big data staging framework, based on MPI-IO and the Swift parallel scripting language. We discuss a range of large-scale data management issues involved in X-ray scattering science and measure the performance benefits of the new staging framework for high-energy diffraction microscopy, an important emerging application in data-intensive X-ray scattering. We show that our framework accelerates scientific processing turnaround from three months to under 10 minutes, and that our I/O technique reduces input overheads by a factor of 5 on 8K Blue Gene/Q nodes.

preprint2020arXiv

funcX: A Federated Function Serving Fabric for Science

Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130000 concurrent workers.

preprint2020arXiv

In-situ Workflow Auto-tuning via Combining Performance Models of Component Applications

In-situ parallel workflows couple multiple component applications, such as simulation and analysis, via streaming data transfer. in order to avoid data exchange via shared file systems. Such workflows are challenging to configure for optimal performance due to the large space of possible configurations. Expert experience is rarely sufficient to identify optimal configurations, and existing empirical auto-tuning approaches are inefficient due to the high cost of obtaining training data for machine learning models. It is also infeasible to optimize individual components independently, due to component interactions. We propose here a new auto-tuning method, Component-based Ensemble Active Learning (CEAL), that combines machine learning techniques with knowledge of in-situ workflow structure to enable automated workflow configuration with a limited number of performance measurements.

preprint2020arXiv

Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.

preprint2020arXiv

The Data Station: Combining Data, Compute, and Market Forces

This paper introduces Data Stations, a new data architecture that we are designing to tackle some of the most challenging data problems that we face today: access to sensitive data; data discovery and integration; and governance and compliance. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly seen, accessed, or downloaded by anyone. Data Stations do not deliver data to users; instead, users bring questions to data. This inversion of the usual relationship between data and compute mitigates many of the security risks that are otherwise associated with sharing and working with sensitive data. Data Stations are designed following the principle that many data problems require human involvement, and that incentives are the key to obtaining such involvement. To that end, Data Stations implement market designs to create, manage, and coordinate the use of incentives. We explain the motivation for this new kind of platform and its design.

preprint2020arXiv

The Manufacturing Data and Machine Learning Platform: Enabling Real-time Monitoring and Control of Scientific Experiments via IoT

IoT devices and sensor networks present new opportunities for measuring, monitoring, and guiding scientific experiments. Sensors, cameras, and instruments can be combined to provide previously unachievable insights into the state of ongoing experiments. However, IoT devices can vary greatly in the type, volume, and velocity of data they generate, making it challenging to fully realize this potential. Indeed, synergizing diverse IoT data streams in near-real time can require the use of machine learning (ML). In addition, new tools and technologies are required to facilitate the collection, aggregation, and manipulation of sensor data in order to simplify the application of ML models and in turn, fully realize the utility of IoT devices in laboratories. Here we will demonstrate how the use of the Argonne-developed Manufacturing Data and Machine Learning (MDML) platform can analyze and use IoT devices in a manufacturing experiment. MDML is designed to standardize the research and operational environment for advanced data analytics and AI-enabled automated process optimization by providing the infrastructure to integrate AI in cyber-physical systems for in situ analysis. We will show that MDML is capable of processing diverse IoT data streams, using multiple computing resources, and integrating ML models to guide an experiment.

preprint2020arXiv

Translating the Grid: How a Translational Approach Shaped the Development of Grid Computing

A growing gap between progress in biological knowledge and improved health outcomes inspired the new discipline of translational medicine, in which the application of new knowledge is an explicit part of a research plan. Abramson and Parashar argue that a similar gap between complex computational technologies and ever-more-challenging applications demands an analogous discipline of translational computer science, in which the deliberate movement of research results into large-scale practice becomes a central research focus rather than an afterthought. We revisit from this perspective the development and application of grid computing from the mid-1990s onwards, and find that a translational framing is useful for understanding the technology's development and impact. We discuss how the development of grid computing infrastructure, and the Globus Toolkit, in particular, benefited from a translational approach. We identify lessons learned that can be applied to other translational computer science initiatives.

preprint2019arXiv

A Data Ecosystem to Support Machine Learning in Materials Science

Facilitating the application of machine learning to materials science problems will require enhancing the data ecosystem to enable discovery and collection of data from many sources, automated dissemination of new data across the ecosystem, and the connecting of data with materials-specific machine learning models. Here, we present two projects, the Materials Data Facility (MDF) and the Data and Learning Hub for Science (DLHub), that address these needs. We use examples to show how MDF and DLHub capabilities can be leveraged to link data with machine learning models and how users can access those capabilities through web and programmatic interfaces.

preprint2019arXiv

Machine Learning Prediction of Accurate Atomization Energies of Organic Molecules from Low-Fidelity Quantum Chemical Calculations

Recent studies illustrate how machine learning (ML) can be used to bypass a core challenge of molecular modeling: the tradeoff between accuracy and computational cost. Here, we assess multiple ML approaches for predicting the atomization energy of organic molecules. Our resulting models learn the difference between low-fidelity, B3LYP, and high-accuracy, G4MP2, atomization energies, and predict the G4MP2 atomization energy to 0.005 eV (mean absolute error) for molecules with less than 9 heavy atoms and 0.012 eV for a small set of molecules with between 10 and 14 heavy atoms. Our two best models, which have different accuracy/speed tradeoffs, enable the efficient prediction of G4MP2-level energies for large molecules and are available through a simple web interface.

preprint2019arXiv

TomoGAN: Low-Dose Synchrotron X-Ray Tomography with Generative Adversarial Networks

Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions from tens of micrometers to a few nanometers. In order to resolve sample features at smaller length scales, however, a higher radiation dose is required. Therefore, the limitation on the achievable resolution is set primarily by noise at these length scales. We present \TOMOGAN{}, a denoising technique based on generative adversarial networks, for improving the quality of reconstructed images for low-dose imaging conditions. We evaluate our approach in two photon-budget-limited experimental conditions: (1) sufficient number of low-dose projections (based on Nyquist sampling), and (2) insufficient or limited number of high-dose projections. In both cases the angular sampling is assumed to be isotropic, and the photon budget throughout the experiment is fixed based on the maximum allowable radiation dose on the sample. Evaluation with both simulated and experimental datasets shows that our approach can significantly reduce noise in reconstructed images, improving the structural similarity score of simulation and experimental data from 0.18 to 0.9 and from 0.18 to 0.41, respectively. Furthermore, the quality of the reconstructed images with filtered back projection followed by our denoising approach exceeds that of reconstructions with the simultaneous iterative reconstruction technique, showing the computational superiority of our approach.

preprint2008arXiv

Accelerating Large-scale Data Exploration through Data Diffusion

Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both micro-benchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.

preprint2008arXiv

Cloud Computing and Grid Computing 360-Degree Compared

Cloud Computing has become another buzzword after Web 2.0. However, there are dozens of different definitions for Cloud Computing and there seems to be no consensus on what a Cloud is. On the other hand, Cloud Computing is not a completely new concept; it has intricate connection to the relatively new but thirteen-year established Grid Computing paradigm, and other relevant technologies such as utility computing, cluster computing, and distributed systems in general. This paper strives to compare and contrast Cloud Computing with Grid Computing from various angles and give insights into the essential characteristics of both.

preprint2008arXiv

Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming

Loosely coupled programming is a powerful paradigm for rapidly creating higher-level applications from scientific programs on petascale systems, typically using scripting languages. This paradigm is a form of many-task computing (MTC) which focuses on the passing of data between programs as ordinary files rather than messages. While it has the significant benefits of decoupling producer and consumer and allowing existing application programs to be executed in parallel with no recoding, its typical implementation using shared file systems places a high performance burden on the overall system and on the user who will analyze and consume the downstream data. Previous efforts have achieved great speedups with loosely coupled programs, but have done so with careful manual tuning of all shared file system access. In this work, we evaluate a prototype collective IO model for file-based MTC. The model enables efficient and easy distribution of input data files to computing nodes and gathering of output results from them. It eliminates the need for such manual tuning and makes the programming of large-scale clusters using a loosely coupled model easier. Our approach, inspired by in-memory approaches to collective operations for parallel programming, builds on fast local file systems to provide high-speed local file caches for parallel scripts, uses a broadcast approach to handle distribution of common input data, and uses efficient scatter/gather and caching techniques for input and output. We describe the design of the prototype model, its implementation on the Blue Gene/P supercomputer, and present preliminary measurements of its performance on synthetic benchmarks and on a large-scale molecular dynamics application.

preprint2008arXiv

Towards Loosely-Coupled Programming on Petascale Systems

We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second.

preprint2004arXiv

DiPerF: an automated DIstributed PERformance testing Framework

We present DiPerF, a distributed performance testing framework, aimed at simplifying and automating service performance evaluation. DiPerF coordinates a pool of machines that test a target service, collects and aggregates performance metrics, and generates performance statistics. The aggregate data collected provide information on service throughput, on service "fairness" when serving multiple clients concurrently, and on the impact of network latency on service performance. Furthermore, using this data, it is possible to build predictive models that estimate a service performance given the service load. We have tested DiPerF on 100+ machines on two testbeds, Grid3 and PlanetLab, and explored the performance of job submission services (pre WS GRAM and WS GRAM) included with Globus Toolkit 3.2.

Ian Foster

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

PRISM: Protocol Refinement through Intelligent Simulation Modeling

Bridging Data Center AI Systems with Edge Computing for Actionable Information Retrieval

Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulations

CUF-Links: Continuous and Ubiquitous FAIRness Linkages for reproducible research

Extended Abstract: Productive Parallel Programming with Parsl

fairDMS: Rapid Model Training by Data and Model Reuse

Linking Scientific Instruments and HPC: Patterns, Technologies, Experiences

RADICAL-Pilot and Parsl: Executing Heterogeneous Workflows on HPC Platforms

Real-Time Streaming and Event-driven Control of Scientific Experiments

Sharing Begins at Home

The History of the Grid

Ultrafast Focus Detection for Automated Microscopy

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Fast and accurate learned multiresolution dynamical downscaling for precipitation

Big Data Staging with MPI-IO for Interactive X-ray Science

funcX: A Federated Function Serving Fabric for Science

In-situ Workflow Auto-tuning via Combining Performance Models of Component Applications

Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

The Data Station: Combining Data, Compute, and Market Forces

The Manufacturing Data and Machine Learning Platform: Enabling Real-time Monitoring and Control of Scientific Experiments via IoT

Translating the Grid: How a Translational Approach Shaped the Development of Grid Computing

A Data Ecosystem to Support Machine Learning in Materials Science

Machine Learning Prediction of Accurate Atomization Energies of Organic Molecules from Low-Fidelity Quantum Chemical Calculations

TomoGAN: Low-Dose Synchrotron X-Ray Tomography with Generative Adversarial Networks

Accelerating Large-scale Data Exploration through Data Diffusion

Cloud Computing and Grid Computing 360-Degree Compared

Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming

Towards Loosely-Coupled Programming on Petascale Systems

DiPerF: an automated DIstributed PERformance testing Framework