Researcher profile

Hatem Ltaief

Hatem Ltaief contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2020arXiv

Geostatistical Modeling and Prediction Using Mixed-Precision Tile Cholesky Factorization

Geostatistics represents one of the most challenging classes of scientific applications due to the desire to incorporate an ever increasing number of geospatial locations to accurately model and predict environmental phenomena. For example, the evaluation of the Gaussian log-likelihood function, which constitutes the main computational phase, involves solving systems of linear equations with a large dense symmetric and positive definite covariance matrix. Cholesky, the standard algorithm, requires O(n^3) floating point operators and has an O(n^2) memory footprint, where n is the number of geographical locations. Here, we present a mixed-precision tile algorithm to accelerate the Cholesky factorization during the log-likelihood function evaluation. Under an appropriate ordering, it operates with double-precision arithmetic on tiles around the diagonal, while reducing to single-precision arithmetic for tiles sufficiently far off. This translates into an improvement of the performance without any deterioration of the numerical accuracy of the application. We rely on the StarPU dynamic runtime system to schedule the tasks and to overlap them with data movement. To assess the performance and the accuracy of the proposed mixed-precision algorithm, we use synthetic and real datasets on various shared and distributed-memory systems possibly equipped with hardware accelerators. We compare our mixed-precision Cholesky factorization against the double-precision reference implementation as well as an independent block approximation method. We obtain an average of 1.6X performance speedup on massively parallel architectures while maintaining the accuracy necessary for modeling and prediction.

preprint2018arXiv

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

We present ExaGeoStat, a high performance framework for geospatial statistics in climate and environment modeling. In contrast to simulation based on partial differential equations derived from first-principles modeling, ExaGeoStat employs a statistical model based on the evaluation of the Gaussian log-likelihood function, which operates on a large dense covariance matrix. Generated by the parametrizable Matern covariance function, the resulting matrix is symmetric and positive definite. The computational tasks involved during the evaluation of the Gaussian log-likelihood function become daunting as the number n of geographical locations grows, as O(n2) storage and O(n3) operations are required. While many approximation methods have been devised from the side of statistical modeling to ameliorate these polynomial complexities, we are interested here in the complementary approach of evaluating the exact algebraic result by exploiting advances in solution algorithms and many-core computer architectures. Using state-of-the-art high performance dense linear algebra libraries associated with various leading edge parallel architectures (Intel KNLs, NVIDIA GPUs, and distributed-memory systems), ExaGeoStat raises the game for statistical applications from climate and environmental science. ExaGeoStat provides a reference evaluation of statistical parameters, with which to assess the validity of the various approaches based on approximation. The framework takes a first step in the merger of large-scale data analytics and extreme computing for geospatial statistical applications, to be followed by additional complexity reducing improvements from the solver side that can be implemented under the same interface. Thus, a single uncompromised statistical model can ultimately be executed in a wide variety of emerging exascale environments.

preprint2014arXiv

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

KBLAS is a new open source high performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameters, KBLAS is able to efficiently run on various GPU architectures across different generations, avoiding the time-consuming step of code rewriting, while still being compliant with the standard BLAS API. Another advanced optimization technique allows to ensure coalesced memory access when dealing with submatrices, especially in the context of high level dense linear algebra algorithms. All four precisions KBLAS kernels have been leveraged to multi-GPUs environment, which requires the introduction of new APIs to ease users' experiences on these challenging systems. The KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes, achieves asymptotically up to 50% and 60% speedup on single GPU and multi-GPUs systems, respectively, and validates our performance model. A subset of KBLAS high performance kernels has been integrated into NVIDIA's standard BLAS implementation (cuBLAS) for larger dissemination, starting version 6.0.

preprint2014arXiv

Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking

We study the impact of tunable parameters on computational intensity (i.e., inverse code balance) and energy consumption of multicore-optimized wavefront diamond temporal blocking (MWD) applied to different stencil-based update schemes. MWD combines the concepts of diamond tiling and multicore-aware wavefront blocking in order to achieve lower cache size requirements than standard single-core wavefront temporal blocking. We analyze the impact of the cache block size on the theoretical and observed code balance, introduce loop tiling in the leading dimension to widen the range of applicable diamond sizes, and show performance results on a contemporary Intel CPU. The impact of code balance on power dissipation on the CPU and in the DRAM is investigated and shows that DRAM power is a decisive factor for energy consumption, which is strongly influenced by the code balance. Furthermore we show that highest performance does not necessarily lead to lowest energy even if the clock speed is fixed.

preprint2012arXiv

Data-Driven Execution of Fast Multipole Methods

Fast multipole methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation supercomputers. Their most common application is to accelerate N-body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load-balancing becomes a non-trivial question. A common strategy for load-balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data-driven execution to efficiently tackle this challenging load-balancing problem. The core idea consists of breaking the most time-consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a Directed Acyclic Graph (DAG) where nodes represent tasks, and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the QUARK runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out-of-order execution. The performance results of the data-driven FMM execution outperform the previous strategy and show linear speedup on a quad-socket quad-core Intel Xeon system.