Researcher profile

Mohamed Wahib

Mohamed Wahib contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

A Pairwise Comparison Relation-assisted Multi-objective Evolutionary Neural Architecture Search Method with Multi-population Mechanism

Neural architecture search (NAS) has emerged as a powerful paradigm that enables researchers to automatically explore vast search spaces and discover efficient neural networks. However, NAS suffers from a critical bottleneck, i.e. the evaluation of numerous architectures during the search process demands substantial computing resources and time. In order to improve the efficiency of NAS, a series of methods have been proposed to reduce the evaluation time of neural architectures. However, they are not efficient enough and still only focus on the accuracy of architectures. Beyond classification accuracy, real-world applications increasingly demand more efficient and compact network architectures that balance multiple performance criteria. To address these challenges, we propose the SMEMNAS, a pairwise comparison relation-assisted multi-objective evolutionary algorithm based on a multi-population mechanism. In the SMEMNAS, a surrogate model is constructed based on pairwise comparison relations to predict the accuracy ranking of architectures, rather than the absolute accuracy. Moreover, two populations cooperate with each other in the search process, i.e. a main population that guides the evolutionary process, while a vice population that enhances search diversity. Our method aims to discover high-performance models that simultaneously optimize multiple objectives. We conduct comprehensive experiments on CIFAR-10, CIFAR-100 and ImageNet datasets to validate the effectiveness of our approach. With only a single GPU searching for 0.17 days, competitive architectures can be found by SMEMNAS which achieves 78.91% accuracy with the MAdds of 570M on the ImageNet. This work makes a significant advancement in the field of NAS.

preprint2024arXiv

CG-Kit: Code Generation Toolkit for Performant and Maintainable Variants of Source Code Applied to Flash-X Hydrodynamics Simulations

CG-Kit is a new code generation toolkit that we propose as a solution for portability and maintainability for scientific computing applications. The development of CG-Kit is rooted in the urgent need created by the shifting landscape of high-performance computing platforms and the algorithmic complexities of a particular large-scale multiphysics application: Flash-X. This combination leads to unique challenges including handling an existing large code base in Fortran and/or C/C++, subdivision of code into a great variety of units supporting a wide range of physics and numerical methods, different parallelization techniques for distributed- and shared-memory systems and accelerator devices, and heterogeneity of computing platforms requiring coexisting variants of parallel algorithms. The challenges demand that developers determine custom abstractions and granularity for code generation. CG-Kit tackles this with standalone tools that can be combined into highly specific and, we argue, highly effective portability and maintainability tool chains. Here we present the design of our new tools: parametrized source trees, control flow graphs, and recipes. The tools are implemented in Python. Although the tools are agnostic to the programming language of the source code, we focus on C/C++ and Fortran. Code generation experiments demonstrate the generation of variants of parallel algorithms: first, multithreaded variants of the basic AXPY operation (scalar-vector addition and vector-vector multiplication) to introduce the application of CG-Kit tool chains; and second, variants of parallel algorithms within a hydrodynamics solver, called Spark, from Flash-X that operates on block-structured adaptive meshes. In summary, code generated by CG-Kit achieves a reduction by over 60% of the original C/C++/Fortran source code.

preprint2022arXiv

Flash-X, a multiphysics simulation software instrument

Flash-X is a highly composable multiphysics software system that can be used to simulate physical phenomena in several scientific domains. It derives some of its solvers from FLASH, which was first released in 2000. Flash-X has a new framework that relies on abstractions and asynchronous communications for performance portability across a range of increasingly heterogeneous hardware platforms. Flash-X is meant primarily for solving Eulerian formulations of applications with compressible and/or incompressible reactive flows. It also has a built-in, versatile Lagrangian framework that can be used in many different ways, including implementing tracers, particle-in-cell simulations, and immersed boundary methods.

preprint2022arXiv

Preparing for the Future -- Rethinking Proxy Apps

A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center to develop their own benchmarks. Unfortunately, proxy applications force HPC centers and providers (vendors) into a an undesirable state of rigidity, in contrast to the fast-moving trends of current technology and future heterogeneity. To accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade, and avoid repeating the past mistakes. This position paper outlines the current state-of-the-art in system co-design, challenges encountered over the past years, and a proposed plan to move forward.

preprint2021arXiv

GTOPX Space Mission Benchmarks

This contribution introduces the GTOPX space mission benchmark collection, which is an extension of GTOP database published by the European Space Agency (ESA). GTOPX consists of ten individual benchmark instances representing real-world interplanetary space trajectory design problems. In regard to the original GTOP collection, GTOPX includes three new problem instances featuring mixed-integer and multi-objective properties. GTOPX enables a simplified user handling, unified benchmark function call and some minor bug corrections to the original GTOP implementation. Furthermore, GTOPX is linked from it's original C++ source code to Python and Matlab based on dynamic link libraries, assuring computationally fast and accurate reproduction of the benchmark results in all three programming languages. Space mission trajectory design problems as those represented in GTOPX are known to be highly non-linear and difficult to solve. The GTOPX collection, therefore, aims particularly at researchers wishing to put advanced (meta)heuristic and hybrid optimization algorithms to the test. The goal of this paper is to provide researchers with a manual and reference to the newly available GTOPX benchmark software.

preprint2021arXiv

Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?

Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too. Hence, our goal is to identify the practical added benefits for HPC and machine learning applications by having access to matrix engines. For this purpose, we perform an in-depth survey of software stacks, proxy applications and benchmarks, and historical batch job records. We provide a cost-benefit analysis of matrix engines, both asymptotically and in conjunction with state-of-the-art processors. While our empirical data will temper the enthusiasm, we also outline opportunities to misuse these dense matrix-multiplication engines if they come for free.

preprint2020arXiv

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of those synchronization methods. This work explores important undocumented features and provides an in-depth analysis of the performance considerations and pitfalls of the state-of-art synchronization methods for Nvidia GPUs. The provided analysis would be useful when making design choices for applications, libraries, and frameworks running on single and/or multi-GPU environments. We provide a case study of the commonly used reduction operator to illustrate how the knowledge gained in our analysis can be useful. We also describe our micro-benchmarks and measurement methods.

preprint2020arXiv

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.

preprint2020arXiv

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on the host. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.