Source author record

Rio Yokota

Rio Yokota appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Numerical Analysis Distributed, Parallel, and Cluster Computing math.NA Mathematical Software physics.comp-ph Machine Learning Artificial Intelligence Computation Computational Engineering, Finance, and Science Computer Vision physics.chem-ph physics.flu-dyn Robotics

Catalog footprint

What is connected

20works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching

Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. This limits IRL applications in the real world, where environment interactions can become highly expensive. To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number of interactions with the environment, (2) learns a stationary reward function that is transferable with high generalization capabilities on changing dynamics, and (3) leverages mode-covering behavior for faster convergence. We demonstrate that our method is considerably more sample efficient and generalizes to novel environments through the experiments. Our method achieves better or comparable results on policy performance baselines with significantly fewer interactions. Furthermore, we empirically show that the recovered reward function generalizes to different tasks where prior arts are prone to fail.

preprint2022arXiv

Parallel QR Factorization of Block Low-Rank Matrices

We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR, and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of $\mathcal{O}(mn)$, while the tiled BLR-QR exhibits $\mathcal{O}(mn^{1.5})$ complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram-Schmidt (MGS) BLR-QR using fork-join parallelism, and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k $\times$ 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill-conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.

preprint2022arXiv

Scalable Linear Time Dense Direct Solver for 3-D Problems Without Trailing Sub-Matrix Dependencies

Factorization of large dense matrices are ubiquitous in engineering and data science applications, e.g. preconditioners for iterative boundary integral solvers, frontal matrices in sparse multifrontal solvers, and computing the determinant of covariance matrices. HSS and $\mathcal{H}^2$-matrices are hierarchical low-rank matrix formats that can reduce the complexity of factorizing such dense matrices from $\mathcal{O}(N^3)$ to $\mathcal{O}(N)$. For HSS matrices, it is possible to remove the dependency on the trailing matrices during Cholesky/LU factorization, which results in a highly parallel algorithm. However, the weak admissibility of HSS causes the rank of off-diagonal blocks to grow for 3-D problems, and the method is no longer $\mathcal{O}(N)$. On the other hand, the strong admissibility of $\mathcal{H}^2$-matrices allows it to handle 3-D problems in $\mathcal{O}(N)$, but introduces a dependency on the trailing matrices. In the present work, we pre-compute the fill-ins and integrate them into the shared basis, which allows us to remove the dependency on trailing-matrices even for $\mathcal{H}^2$-matrices. Comparisons with a block low-rank factorization code LORAPO showed a maximum speed up of 4,700x for a 3-D problem with complex geometry.

preprint2020arXiv

Epipolar-Guided Deep Object Matching for Scene Change Detection

This paper describes a viewpoint-robust object-based change detection network (OBJ-CDNet). Mobile cameras such as drive recorders capture images from different viewpoints each time due to differences in camera trajectory and shutter timing. However, previous methods for pixel-wise change detection are vulnerable to the viewpoint differences because they assume aligned image pairs as inputs. To cope with the difficulty, we introduce a deep graph matching network that establishes object correspondence between an image pair. The introduction enables us to detect object-wise scene changes without precise image alignment. For more accurate object matching, we propose an epipolar-guided deep graph matching network (EGMNet), which incorporates the epipolar constraint into the deep graph matching layer used in OBJCDNet. To evaluate our network's robustness against viewpoint differences, we created synthetic and real datasets for scene change detection from an image pair. The experimental results verified the effectiveness of our network.

preprint2020arXiv

Scalable and Practical Natural Gradient for Large-Scale Deep Learning

Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We propose Scalable and Practical Natural Gradient Descent (SP-NGD), a principled approach for training models that allows them to attain similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. Furthermore, SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. We evaluated SP-NGD on a benchmark task where highly optimized first-order methods are available as references: training a ResNet-50 model for image classification on ImageNet. We demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.

preprint2016arXiv

A Matrix-free Preconditioner for the Helmholtz Equation based on the Fast Multipole Method

Fast multipole methods (FMM) were originally developed for accelerating $N$-body problems for particle-based methods. FMM is more than an $N$-body solver, however. Recent efforts to view the FMM as an elliptic Partial Differential Equation (PDE) solver have opened the possibility to use it as a preconditioner for a broader range of applications. FMM can solve Helmholtz problems with optimal $\mathcal{O}(N \log N)$ complexity, has compute-bound inner kernels, and highly asynchronous communication patterns. The combination of these features makes FMM an interesting candidate as a preconditioner for sparse solvers on architectures of the future. The use of FMM as a preconditioner allows us to use lower order multipole expansions than would be required as a solver because individual solves need not be accurate. This reduces the amount of computation and communication significantly and makes the time-to-solution competitive with state-of-the-art preconditioners. Furthermore, the high asynchronicity of FMM allows it to scale to much larger core counts than factorization-based and multilevel methods. We describe our tests in reproducible details with freely available codes.

preprint2016arXiv

Fast Multipole Method as a Matrix-Free Hierarchical Low-Rank Approximation

There has been a large increase in the amount of work on hierarchical low-rank approximation methods, where the interest is shared by multiple communities that previously did not intersect. This objective of this article is two-fold; to provide a thorough review of the recent advancements in this field from both analytical and algebraic perspectives, and to present a comparative benchmark of two highly optimized implementations of contrasting methods for some simple yet representative test cases. We categorize the recent advances in this field from the perspective of compute-memory tradeoff, which has not been considered in much detail in this area. Benchmark tests reveal that there is a large difference in the memory consumption and performance between the different methods.

preprint2016arXiv

Fast Multipole Preconditioners for Sparse Matrices Arising from Elliptic Equations

Among optimal hierarchical algorithms for the computational solution of elliptic problems, the Fast Multipole Method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxable global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for satisfying conditions at finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Here, we do not discuss the well developed applications of FMM to implement matrix-vector multiplications within Krylov solvers of boundary element methods. Instead, we propose using FMM for the volume-to-volume contribution of inhomogeneous Poisson-like problems, where the boundary integral is a small part of the overall computation. Our method may be used to precondition sparse matrices arising from finite difference/element discretizations, and can handle a broader range of scientific applications. Compared with multigrid methods, it is capable of comparable algebraic convergence rates down to the truncation error of the discretized PDE, and it offers potentially superior multicore and distributed memory scalability properties on commodity architecture supercomputers. Compared with other methods exploiting the low rank character of off-diagonal blocks of the dense resolvent operator, FMM-preconditioned Krylov iteration may reduce the amount of communication because it is matrix-free and exploits the tree structure of FMM. We describe our tests in reproducible detail with freely available codes and outline directions for further extensibility.

preprint2016arXiv

Multi-Level Restricted Maximum Likelihood Covariance Estimation and Kriging for Large Non-Gridded Spatial Datasets

We develop a multi-level restricted Gaussian maximum likelihood method for estimating the covariance function parameters and computing the best unbiased predictor. Our approach produces a new set of multi-level contrasts where the deterministic parameters of the model are filtered out thus enabling the estimation of the covariance parameters to be decoupled from the deterministic component. Moreover, the multi-level covariance matrix of the contrasts exhibit fast decay that is dependent on the smoothness of the covariance function. Due to the fast decay of the multi-level covariance matrix coefficients only a small set is computed with a level dependent criterion. We demonstrate our approach on problems of up to 512,000 observations with a Matern covariance function and highly irregular placements of the observations. In addition, these problems are numerically unstable and hard to solve with traditional methods.

preprint2014arXiv

A Performance Model for the Communication in Fast Multipole Methods on HPC Platforms

Exascale systems are predicted to have approximately one billion cores, assuming Gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the current parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. There is therefore an urgent need to model application performance and to understand what changes need to be made to ensure extrapolated scalability. The fast multipole method (FMM) was originally developed for accelerating N-body problems in astrophysics and molecular dynamics, but has recently been extended to a wider range of problems, including preconditioners for sparse linear solvers. It's high arithmetic intensity combined with its linear complexity and asynchronous communication patterns makes it a promising algorithm for exascale systems. In this paper, we discuss the challenges for FMM on current parallel computers and future exascale architectures, with a focus on inter-node communication. We develop a performance model that considers the communication patterns of the FMM, and observe a good match between our model and the actual communication time, when latency, bandwidth, network topology, and multi-core penalties are all taken into account. To our knowledge, this is the first formal characterization of inter-node communication in FMM, which validates the model against actual measurements of communication time.

preprint2014arXiv

Asynchronous Execution of the Fast Multipole Method Using Charm++

Fast multipole methods (FMM) on distributed mem- ory have traditionally used a bulk-synchronous model of com- municating the local essential tree (LET) and overlapping it with computation of the local data. This could be perceived as an extreme case of data aggregation, where the whole LET is communicated at once. Charm++ allows a much finer control over the granularity of communication, and has a asynchronous execution model that fits well with the structure of our FMM code. Unlike previous work on asynchronous fast N-body methods such as ChaNGa and PEPC, the present work performs a direct comparison against the traditional bulk-synchronous approach and the asynchronous approach using Charm++. Furthermore, the serial performance of our FMM code is over an order of magnitude better than these previous codes, so it is much more challenging to hide the overhead of Charm++.

preprint2014arXiv

Communication Complexity of the Fast Multipole Method and its Algebraic Variants

A combination of hierarchical tree-like data structures and data access patterns from fast multipole methods and hierarchical low-rank approximation of linear operators from H-matrix methods appears to form an algorithmic path forward for efficient implementation of many linear algebraic operations of scientific computing at the exascale. The combination provides asymptotically optimal computational and communication complexity and applicability to large classes of operators that commonly arise in scientific computing applications. A convergence of the mathematical theories of the fast multipole and H-matrix methods has been underway for over a decade. We recap this mathematical unification and describe implementation aspects of a hybrid of these two compelling hierarchical algorithms on hierarchical distributed-shared memory architectures, which are likely to be the first to reach the exascale. We present a new communication complexity estimate for fast multipole methods on such architectures. We also show how the data structures and access patterns of H-matrices for low-rank operators map onto those of fast multipole, leading to an algebraically generalized form of fast multipole that compromises none of its architecturally ideal properties.

preprint2012arXiv

An FMM Based on Dual Tree Traversal for Many-core Architectures

The present work attempts to integrate the independent efforts in the fast N-body community to create the fastest N-body library for many-core and heterogenous architectures. Focus is placed on low accuracy optimizations, in response to the recent interest to use FMM as a preconditioner for sparse linear solvers. A direct comparison with other state-of-the-art fast N-body codes demonstrates that orders of magnitude increase in performance can be achieved by careful selection of the optimal algorithm and low-level optimization of the code. The current N-body solver uses a fast multipole method with an efficient strategy for finding the list of cell-cell interactions by a dual tree traversal. A task-based threading model is used to maximize thread-level parallelism and intra-node load-balancing. In order to extract the full potential of the SIMD units on the latest CPUs, the inner kernels are optimized using AVX instructions. Our code -- exaFMM -- is an order of magnitude faster than the current state-of-the-art FMM codes, which are themselves an order of magnitude faster than the average FMM code.

preprint2012arXiv

Data-Driven Execution of Fast Multipole Methods

Fast multipole methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation supercomputers. Their most common application is to accelerate N-body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load-balancing becomes a non-trivial question. A common strategy for load-balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data-driven execution to efficiently tackle this challenging load-balancing problem. The core idea consists of breaking the most time-consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a Directed Acyclic Graph (DAG) where nodes represent tasks, and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the QUARK runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out-of-order execution. The performance results of the data-driven FMM execution outperform the previous strategy and show linear speedup on a quad-socket quad-core Intel Xeon system.

preprint2012arXiv

FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method

The Lagrangian vortex method offers an alternative numerical approach for direct numerical simulation of turbulence. The fact that it uses the fast multipole method (FMM)--a hierarchical algorithm for N-body problems with highly scalable parallel implementations--as numerical engine makes it a potentially good candidate for exascale systems. However, there have been few validation studies of Lagrangian vortex simulations and the insufficient comparisons against standard DNS codes has left ample room for skepticism. This paper presents a comparison between a Lagrangian vortex method and a pseudo-spectral method for the simulation of decaying homogeneous isotropic turbulence. This flow field is chosen despite the fact that it is not the most favorable flow problem for particle methods (which shine in wake flows or where vorticity is compact), due to the fact that it is ideal for the quantitative validation of DNS codes. We use a 256^3 grid with Re_lambda=50 and 100 and look at the turbulence statistics, including high-order moments. The focus is on the effect of the various parameters in the vortex method, e.g., order of FMM series expansion, frequency of reinitialization, overlap ratio and time step. The vortex method uses an FMM code (exaFMM) that runs on GPU hardware using CUDA, while the spectral code (hit3d) runs on CPU only. Results indicate that, for this application (and with the current code implementations), the spectral method is an order of magnitude faster than the vortex method when using a single GPU for the FMM and six CPU cores for the FFT.

preprint2011arXiv

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape.

preprint2011arXiv

Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns

We present teraflop-scale calculations of biomolecular electrostatics enabled by the combination of algorithmic and hardware acceleration. The algorithmic acceleration is achieved with the fast multipole method (FMM) in conjunction with a boundary element method (BEM) formulation of the continuum electrostatic model, as well as the BIBEE approximation to BEM. The hardware acceleration is achieved through graphics processors, GPUs. We demonstrate the power of our algorithms and software for the calculation of the electrostatic interactions between biological molecules in solution. The applications demonstrated include the electrostatics of protein--drug binding and several multi-million atom systems consisting of hundreds to thousands of copies of lysozyme molecules. The parallel scalability of the software was studied in a cluster at the Nagasaki Advanced Computing Center, using 128 nodes, each with 4 GPUs. Delicate tuning has resulted in strong scaling with parallel efficiency of 0.8 for 256 and 0.5 for 512 GPUs. The largest application run, with over 20 million atoms and one billion unknowns, required only one minute on 512 GPUs. We are currently adapting our BEM software to solve the linearized Poisson-Boltzmann equation for dilute ionic solutions, and it is also designed to be flexible enough to be extended for a variety of integral equation problems, ranging from Poisson problems to Helmholtz problems in electromagnetics and acoustics to high Reynolds number flow.

preprint2011arXiv

Hierarchical N-body simulations with auto-tuning for heterogeneous systems

With the current hybridization of treecodes and FMMs, combined with auto-tuning capabilities on heterogeneous architectures, the flexibility of fast N-body methods has been greatly enhanced. These features are a requirement to developing a black-box software library for fast N-body algorithms on heterogeneous systems, which is our immediate goal.

preprint2010arXiv

Treecode and fast multipole method for N-body simulation with CUDA

Due to the variety and importance of applications of treecodes and FMM, the combination of algorithmic acceleration with hardware acceleration can have tremendous impact. Alas, programming these algorithms efficiently is no piece of cake. In this contribution, we aim to present GPU kernels for treecode and FMM in, as much as possible, an uncomplicated, accessible way. The interested reader should consult some of the copious literature on the subject for a deeper understanding of the algorithms themselves. Here, we will offer the briefest of summaries. We will focus our attention on achieving a GPU implementation that is efficient in its utilization of the architecture, but without applying the most advanced techniques known in the field (which would complicate the presentation).

preprint2009arXiv

PetRBF--A parallel O(N) algorithm for radial basis function interpolation

We have developed a parallel algorithm for radial basis function (RBF) interpolation that exhibits O(N) complexity,requires O(N) storage, and scales excellently up to a thousand processes. The algorithm uses a GMRES iterative solver with a restricted additive Schwarz method (RASM) as a preconditioner and a fast matrix-vector algorithm. Previous fast RBF methods, --,achieving at most O(NlogN) complexity,--, were developed using multiquadric and polyharmonic basis functions. In contrast, the present method uses Gaussians with a small variance (a common choice in particle methods for fluid simulation, our main target application). The fast decay of the Gaussian basis function allows rapid convergence of the iterative solver even when the subdomains in the RASM are very small. The present method was implemented in parallel using the PETSc library (developer version). Numerical experiments demonstrate its capability in problems of RBF interpolation with more than 50 million data points, timing at 106 seconds (19 iterations for an error tolerance of 10^-15 on 1024 processors of a Blue Gene/L (700 MHz PowerPC processors). The parallel code is freely available in the open-source model.

Rio Yokota

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching

Parallel QR Factorization of Block Low-Rank Matrices

Scalable Linear Time Dense Direct Solver for 3-D Problems Without Trailing Sub-Matrix Dependencies

Epipolar-Guided Deep Object Matching for Scene Change Detection

Scalable and Practical Natural Gradient for Large-Scale Deep Learning

A Matrix-free Preconditioner for the Helmholtz Equation based on the Fast Multipole Method

Fast Multipole Method as a Matrix-Free Hierarchical Low-Rank Approximation

Fast Multipole Preconditioners for Sparse Matrices Arising from Elliptic Equations

Multi-Level Restricted Maximum Likelihood Covariance Estimation and Kriging for Large Non-Gridded Spatial Datasets

A Performance Model for the Communication in Fast Multipole Methods on HPC Platforms

Asynchronous Execution of the Fast Multipole Method Using Charm++

Communication Complexity of the Fast Multipole Method and its Algebraic Variants

An FMM Based on Dual Tree Traversal for Many-core Architectures

Data-Driven Execution of Fast Multipole Methods

FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns

Hierarchical N-body simulations with auto-tuning for heterogeneous systems

Treecode and fast multipole method for N-body simulation with CUDA

PetRBF--A parallel O(N) algorithm for radial basis function interpolation