Researcher profile

Mihailo Isakov

Mihailo Isakov contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2022arXiv

A Taxonomy of Error Sources in HPC I/O Machine Learning Models

I/O efficiency is crucial to productivity in scientific computing, but the increasing complexity of the system and the applications makes it difficult for practitioners to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed. We analyze multiple years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.

preprint2022arXiv

Zeno: A Scalable Capability-Based Secure Architecture

Despite the numerous efforts of security researchers, memory vulnerabilities remain a top issue for modern computing systems. Capability-based solutions aim to solve whole classes of memory vulnerabilities at the hardware level by encoding access permissions with each memory reference. While some capability systems have seen commercial adoption, little work has been done to apply a capability model to datacenter-scale systems. Cloud and high-performance computing often require programs to share memory across many compute nodes. This presents a challenge for existing capability models, as capabilities must be enforceable across multiple nodes. Each node must agree on what access permissions a capability has and overheads of remote memory access must remain manageable. To address these challenges, we introduce Zeno, a new capability-based architecture. Zeno supports a Namespace-based capability model to support globally shareable capabilities in a large-scale, multi-node system. In this work, we describe the Zeno architecture, define Zeno's security properties, evaluate the scalability of Zeno as a large-scale capability architecture, and measure the hardware overhead with an FPGA implementation.

preprint2020arXiv

NeuroFabric: Identifying Ideal Topologies for Training A Priori Sparse Networks

Long training times of deep neural networks are a bottleneck in machine learning research. The major impediment to fast training is the quadratic growth of both memory and compute requirements of dense and convolutional layers with respect to their information bandwidth. Recently, training `a priori' sparse networks has been proposed as a method for allowing layers to retain high information bandwidth, while keeping memory and compute low. However, the choice of which sparse topology should be used in these networks is unclear. In this work, we provide a theoretical foundation for the choice of intra-layer topology. First, we derive a new sparse neural network initialization scheme that allows us to explore the space of very deep sparse networks. Next, we evaluate several topologies and show that seemingly similar topologies can often have a large difference in attainable accuracy. To explain these differences, we develop a data-free heuristic that can evaluate a topology independently from the dataset the network will be trained on. We then derive a set of requirements that make a good topology, and arrive at a single topology that satisfies all of them.