Source author record

Hristo Djidjev

Hristo Djidjev appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cond-mat.mtrl-sci Data Structures and Algorithms Discrete Mathematics Distributed, Parallel, and Cluster Computing Emerging Technologies physics.chem-ph physics.comp-ph quant-ph

Catalog footprint

What is connected

4works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Distributed Out-of-Memory SVD on CPU/GPU Architectures

We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of SVD have been proposed, but most only estimate the singular values as an estimation of the singular vectors which can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks seen in the power method are typically associated with the computation of the Gram matrix $\mat{A}^T\mat{A}$, which can be significant when $\mat{A}$ is large and dense, or when $\mat{A}$ is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of $\mat{A}^T\mat{A}$ by using a batching strategy where the intermediate factors are computed block by block. We also suppress I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized \textit{NCCL} based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of size 128PB with 1e-6 sparsity.

preprint2020arXiv

Decomposition algorithms for solving NP-hard problems on a quantum annealer

NP-hard problems such as the maximum clique or minimum vertex cover problems, two of Karp's 21 NP-hard problems, have several applications in computational chemistry, biochemistry and computer network security. Adiabatic quantum annealers can search for the optimum value of such NP-hard optimization problems, given the problem can be embedded on their hardware. However, this is often not possible due to certain limitations of the hardware connectivity structure of the annealer. This paper studies a general framework for a decomposition algorithm for NP-hard graph problems aiming to identify an optimal set of vertices. Our generic algorithm allows us to recursively divide an instance until the generated subproblems can be embedded on the quantum annealer hardware and subsequently solved. The framework is applied to the maximum clique and minimum vertex cover problems, and we propose several pruning and reduction techniques to speed up the recursive decomposition. The performance of both algorithms is assessed in a detailed simulation study.

preprint2016arXiv

Graph-based linear scaling electronic structure theory

We show how graph theory can be combined with quantum theory to calculate the electronic structure of large complex systems. The graph formalism is general and applicable to a broad range of electronic structure methods and materials, including challenging systems such as biomolecules. The methodology combines well-controlled accuracy, low computational cost, and natural low-communication parallelism. This combination addresses substantial shortcomings of linear scaling electronic structure theory, in particular with respect to quantum-based molecular dynamics simulations.

preprint2015arXiv

Shortest-Path Queries in Planar Graphs on GPU-Accelerated Architectures

We develop an efficient parallel algorithm for answering shortest-path queries in planar graphs and implement it on a multi-node CPU/GPU clusters. The algorithm uses a divide-and-conquer approach for decomposing the input graph into small and roughly equal subgraphs and constructs a distributed data structure containing shortest distances within each of those subgraphs and between their boundary vertices. For a planar graph with $n$ vertices, that data structure needs $O(n)$ storage per processor and allows queries to be answered in $O(n^{1/4})$ time.