Source author record

Ryousei Takano

Ryousei Takano appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning Networking and Internet Architecture Performance

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

A Prompt Report on the Performance of Intel Optane DC Persistent Memory Module

In this prompt report, we present the basic performance evaluation of Intel Optane Data Center Persistent Memory Module (Optane DCPMM), which is the first commercially-available, byte-addressable non-volatile memory modules released in April 2019. Since at the moment of writing only a few reports on its performance were published, this letter is intended to complement other performance studies. Through experiments using our own measurement tools, we obtained that the latency of random read-only access was approximately 374 ns. That of random writeback-involving access was 391 ns. The bandwidths of read-only and writeback-involving access for interleaved memory modules were approximately 38 GB/s and 3 GB/s, respectively.

preprint2020arXiv

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on the host. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.

preprint2019arXiv

A Software-based NVM Emulator Supporting Read/Write Asymmetric Latencies

Non-volatile memory (NVM) is a promising technology for low-energy and high-capacity main memory of computers. The characteristics of NVM devices, however, tend to be fundamentally different from those of DRAM (i.e., the memory device currently used for main memory), because of differences in principles of memory cells. Typically, the write latency of an NVM device such as PCM and ReRAM is much higher than its read latency. The asymmetry in read/write latencies likely affects the performance of applications significantly. For analyzing behavior of applications running on NVM-based main memory, most researchers use software-based emulation tools due to the limited number of commercial NVM products. However, these existing emulation tools are too slow to emulate a large-scale, realistic workload or too simplistic to investigate the details of application behavior on NVM with asymmetric read/write latencies. This paper therefore proposes a new NVM emulation mechanism that is not only light-weight but also aware of a read/write latency gap in NVM-based main memory. We implemented the prototype of the proposed mechanism for the Intel CPU processors of the Haswell architecture. We also evaluated its accuracy and performed case studies for practical benchmarks. The results showed that our prototype accurately emulated write-latencies of NVM-based main memory: it emulated the NVM write latencies in a range from 200 ns to 1000 ns with negligible errors from 0.2% to 1.1%. We confirmed that the use of our emulator enabled us to successfully estimate performance of practical workloads for NVM-based main memory, while an existing light-weight emulation model misestimated.

preprint2015arXiv

Feasibility Evaluation of 6LoWPAN over Bluetooth Low Energy

IPv6 over Low power Wireless Personal Area Network (6LoWPAN) is an emerging technology to enable ubiquitous IoT services. However, there are very few studies of the performance evaluation on real hardware environments. This paper demonstrates the feasibility of 6LoWPAN through conducting a preliminary performance evaluation of a commodity hardware environment, including Bluetooth Low Energy (BLE) network, Raspberry Pi, and a laptop PC. Our experimental results show that the power consumption of 6LoWPAN over BLE is one-tenth lower than that of IP over WiFi; the performance significantly depends on the distance between devices and the message size; and the communication completely stops when bursty traffic transfers. This observation provides our optimistic conclusions on the feasibility of 6LoWPAN although the maturity of implementations is a remaining issue.