Researcher profile

Jihyun Kang

Jihyun Kang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2026arXiv

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we perform three quantitative analyses yielding four findings. First, statistical analysis over 751 Prometheus metrics and 10 XID-identified GPU failures achieves a 10/10 detection rate (2/10 pre-XID) at ~0.84 false positives per day. No single metric is consistently dominant across failure types, motivating a multi-signal detection strategy. Second, profiling 523 checkpoint events along the GPU VRAM to NFS path attributes the "bandwidth paradox" (1.4-10.4% utilization of 200 Gbps RoCE) to saturation of the 128-slot NFS RPC layer. Third, multi-node failure response shows concentrated exclusions (top 3 of 63 nodes account for >50% of all exclusions) and an auto-retry chain success rate of 33.3% over 12 chains (73 attempts), 2.7x the 12.5% manual recovery rate; the median retry interval is 11 min (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

preprint2020arXiv

JCMT POL-2 and BISTRO Survey observations of magnetic fields in the L1689 molecular cloud

We present 850$μ$m polarization observations of the L1689 molecular cloud, part of the nearby Ophiuchus molecular cloud complex, taken with the POL-2 polarimeter on the James Clerk Maxwell Telescope (JCMT). We observe three regions of L1689: the clump L1689N which houses the IRAS 16293-2422 protostellar system, the starless clump SMM-16, and the starless core L1689B. We use the Davis-Chandrasekhar-Fermi method to estimate plane-of-sky field strengths of $366\pm 55$ $μ$G in L1689N, $284\pm 34$ $μ$G in SMM-16, and $72\pm 33$ $μ$G in L1689B, for our fiducial value of dust opacity. These values indicate that all three regions are likely to be magnetically trans-critical with sub-Alfvénic turbulence. In all three regions, the inferred mean magnetic field direction is approximately perpendicular to the local filament direction identified in $Herschel$ Space Telescope observations. The core-scale field morphologies for L1689N and L1689B are consistent with the cloud-scale field morphology measured by the $Planck$ Space Observatory, suggesting that material can flow freely from large to small scales for these sources. Based on these magnetic field measurements, we posit that accretion from the cloud onto L1689N and L1689B may be magnetically regulated. However, in SMM-16, the clump-scale field is nearly perpendicular to the field seen on cloud scales by $Planck$, suggesting that it may be unable to efficiently accrete further material from its surroundings.

preprint2020arXiv

Multiple outflows in the high-mass cluster forming region, G25.82-0.17

We present results of continuum and spectral line observations with ALMA and 22 GHz water (H$_2$O) maser observations using KaVA and VERA toward a high-mass star-forming region, G25.82-0.17. Multiple 1.3 mm continuum sources are revealed, indicating the presence of young stellar objects (YSOs) at different evolutionary stages, namely an ultra-compact HII region, G25.82-E, a high-mass young stellar object (HM-YSO), G25.82-W1, and starless cores, G25.82-W2 and G25.82-W3. Two SiO outflows, at N-S and SE-NW orientations, are identified. The CH$_3$OH 8$_{-1}$-7$_{0}$ E line, known to be a class I CH$_3$OH maser at 229 GHz is also detected showing a mixture of thermal and maser emission. Moreover, the H$_2$O masers are distributed in a region ~0.25" shifted from G25.82-W1. The CH$_3$OH 22$_{4}$-21$_{5}$ E line shows a compact ring-like structure at the position of G25.82-W1 with a velocity gradient, indicating a rotating disk or envelope. Assuming Keplerian rotation, the dynamical mass of G25.82-W1 is estimated to be $>$25 M$_{\odot}$ and the total mass of 20 M$_\odot$-84 M$_\odot$ is derived from the 1.3 mm continuum emission. The driving source of the N-S SiO outflow is G25.82-W1 while that of the SE-NW SiO outflow is uncertain. Detection of multiple high-mass starless$/$protostellar cores and candidates without low-mass cores implies that HM-YSOs could form in individual high-mass cores as predicted by the turbulent core accretion model. If this is the case, the high-mass star formation process in G25.82 would be consistent with a scaled-up version of low-mass star formation.